Submitted by _Arsenie_Boca_ t3_10vwm8k in MachineLearning

I am looking for papers that inject information into LMs directly using embeddings (without formatting information as text). I find it notoriously hard to search for these paper because they could come from various different domains, so I thought asking here might be a good option to reach people from many different domains.

Some examples I already found are from the domain of knowledge graph augmented LMs: ERNIE https://arxiv.org/abs/1904.09223 K-BERT https://arxiv.org/abs/1909.07606

Prefix Tuning / Prompt Tuning are also somewhat similar to the idea, but they dont depend on any external information.

Can you think of other papers that inject additional information into LMs via embeddings?

8

Comments

You must log in or register to comment.

wittfm t1_j7jvhoc wrote

Maybe this can help https://www.youtube.com/live/FKsARHV3ZTI they mention the SeFit method which seems similar to what you are looking for.

2

wittfm t1_j7jvkoo wrote

They mention it as an alternative to prompt engineering

1

_Arsenie_Boca_ OP t1_j7jxkxr wrote

Thanks for the answer, but Im afraid the idea there is quite different. They take embeddings from LMs and finetune them, rather than aligning and injecting external embeddings.

1

PassingTumbleweed t1_j7lt1o5 wrote

Any LM with multimodal input? PaLI?

2

_Arsenie_Boca_ OP t1_j7miglb wrote

Thanks, good pointer. I am particularly interested in the different mechanisms how the embeddings might be integrated into LMs. E.g. in PaLI and SimVLM, the external embeddings (here image encodings) are simply treated as token embeddings. Others use modified attention mechanisms to potentially make better use of the information. Are you aware of a work that directly compares multiple integration mechanisms?

1

PassingTumbleweed t1_j7mlwls wrote

I'm not aware of any comparison. Maybe it doesn't matter that much?

PaLI feeds embeddings from the Vision Transformer to the LM after a linear projection layer. It allows back propagation through ViTs weights so that the image encoding can be learned for the task. The ability to tune the embeddings in end-to-end fashion might be an important consideration.

3

_Arsenie_Boca_ OP t1_j7ommq8 wrote

Yes, seamless joint training is definitely one of the perks. I will look further if I can find anything about the effectiveness of different injection/fusion mechanisms.

1

edunuke t1_j7nyh34 wrote

I found this one under the keyword "embedding fusion" in llm:

https://arxiv.org/abs/2101.12294

It provides overview of many methods.

And as other said anything on multimodal fusion transformers.

2

dancingnightly t1_j7s355b wrote

In a sense, you can communicate between semantic text embeddings and LM models through this method(would operate differently to multi modal embeddings): https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight

This method, which is only practical for toy problems really right now, would allow you to use semantic embeddings to find what to look for when doing SVD on an (autoregressive) LM. You could depend this on the input, for example, transforming your embedding into the keys to apply the abduction with in that process, and impacting the generation of logits. I'm not sure this would behave much differently to altering the logit_bias of tokens, but it would be interesting to hear if it was.

2