In the last few days I had a new thought. I don't know if it is possible or already done somewhere? Is it possible to merge the weights of two transformer models like they do with merging stable diffusion models? Like can I merge for example BioBert and LegalBert and get a model that can do both?

Comments

You must log in or register to comment.

[deleted] t1_jdq3nn5 wrote on March 26, 2023 at 8:42 AM

#2,366,903

[deleted]

incrapnito t1_jdqamby wrote on March 26, 2023 at 10:25 AM

#2,367,534

I think you are looking for federated learning which is complete research field on its own. It digs into combining weights of two neural networks such that both tasks can still be accomplished. Existing approaches should apply to transformers too.

tdgros t1_jdqarbq wrote on March 26, 2023 at 10:27 AM

#2,367,559

Replying to [deleted] (#2,366,903)

what's the connection between LoRa and the question about merging weights here?

edit: weird, I saw a notification for an answer from you, but can't see the message...

LoRa is a compression method that replaces weight matrices with low rank approximations for single tasks. It does not merge models or weights

tdgros t1_jdqbgqy wrote on March 26, 2023 at 10:37 AM

#2,367,630

the model merging offered by some stable diffusion UIs do not merge the weights of a network! They merge the denoising results for a single diffusion step from 2 different denoisers, this is very different!

Merging the weights of two different models does not produce something functional in general, it also can only work for 2 models with exactly the same structure. It certainly does not "mix their functionality".

[deleted] t1_jdqc0ax wrote on March 26, 2023 at 10:45 AM

#2,367,681

Replying to tdgros (#2,367,559)

[removed]

Jean-Porte t1_jdqfhrf wrote on March 26, 2023 at 11:31 AM

#2,368,062

Model averaging sounds stupid but it actually kind of works, you could try it. But does it make sense ? It not work as well as the individual models

Co0k1eGal3xy t1_jdqfxcr wrote on March 26, 2023 at 11:36 AM

#2,368,123

Replying to tdgros (#2,367,630)

Most stable diffusion UIs DO merge weights by averaging them
Averaging weights between checkpoints works really well with CLIP fine-tuning, improving performance over both checkpoints for their respective validation sets. https://github.com/mlfoundations/wise-ft
Git-rebasin found that their method of merging weights works for merging checkpoints with completely different pretraining data + init weights and improves accuracy on a mixed validation set over just using one model or the other. https://arxiv.org/abs/2209.04836

You're right that merging the model outputs has higher quality than merging the weights, but OP was asking if it was possible and it is very much possible if the weight tensors have the same shape.

Co0k1eGal3xy t1_jdqgwlh wrote on March 26, 2023 at 11:48 AM

#2,368,250

BioBERT base and LegalBERT use the same architecture so using a technique like Git-rebasin would improve performance over using just one or the other model, however if you want to merge the models and get the best of both models, you should retrain on a merged dataset or use model ensembles instead (aka, load and run both models and intelligently pick which model to listen to for which type of data)

You can not (easily) merge BioBERT large since that checkpoint uses a custom vocabulary, but BioBERT base looks perfectly fine.

tdgros t1_jdqjc8q wrote on March 26, 2023 at 12:15 PM

#2,368,597

Replying to Co0k1eGal3xy (#2,368,123)

there's also weight averaging in eSRGAN that I knew about, but that always irked me. The permutation argument from your third point is the usual reason I evoke on this subject, and the paper does show why it's not as simple as just blending weights! The same reasoning also shows why blending subsequent checkpoints isn't like blending random networks.

_Arsenie_Boca_ t1_jdqy1n8 wrote on March 26, 2023 at 2:28 PM

#2,370,681

Replying to Co0k1eGal3xy (#2,368,123)

Merging model outputs also means you have to run both models. I think the best option is to merge the weights and recover performance using datasets from both domains and distillation from the respective expert model.

locomoto00 t1_jdr34oo wrote on March 26, 2023 at 3:06 PM

#2,371,321

For some models you can simply average the model weights: see https://arxiv.org/pdf/2208.03306.pdf%7D

TeH_Venom t1_je21d7u wrote on March 28, 2023 at 8:59 PM

#2,427,999

Not quite cross model architecture, but it's not impossible to merge different fine tunes of a model into one.

I personally have a few scripts for a few strategies such as

Average merge;
Diff merge;
Block merging. (link)

I haven't tested diff merging or block merges too much (me and a friend finished adapting SD's block merge to LMs last week) but weighted average merges are a pretty safe way of mixing models.