I'm a machine learning PhD student and I'm doing research on LMs and how to reduce their memory footprint.

One idea I've been toying with is Vector Quantized LMs. I'm not talking about VQ as a technique to speed up compute using int8 activations etc etc, but by using a codebook.

The idea is based on an uni-directional RNN that reconstructs the source sequence after quantization. Unlike MLM where the corruption is based on masking and replacing tokens we instead quantize the token vectors and try to the predict the original token based on the quantized version of the token and the unquantized short/long term memory states produced at the previous timestep.

The reason I'm interested in such a convoluted idea is to effectively create a metric to measure entropy of tokens in sequence; if the VQ-LM can reconstruct the correct token with high likelihood then that token is unimportant, but if the VQ-LM fails to predict a token it is likely that this token is of great importance because it is a rare word and this carries higher entropy in the sequence. And the motivation behind wanting to learn to measure such a phenomenon is so we can use this to guide the memory of a transformer: models like the Transformer-XL operate on longer sequences by keeping memory around for keys and values, and the Compressive Transformer takes it a step further by compressing older tokens... Well... what if we used the reconstruction loss from the VQ-LM along with an 'age' metric to guide the memory bank of such a transformer architecture, discarding easily predicted tokens early while keeping higher entropy tokens around for longer?

Has anyone considered such a system before? If done a lot of searching and I've come up blank so far.

Comments

dojoteef t1_j41m2hd wrote on January 12, 2023 at 3:34 PM

See Fast Decoding in Sequence Models using Discrete Latent Variables

Avelina9X OP t1_j4vn244 wrote on January 18, 2023 at 4:06 PM

Thank you for the resource! I'll have a deep dive into this!

dojoteef t1_j4vnho4 wrote on January 18, 2023 at 4:09 PM

Note that the authors have an earlier paper introducing discrete latents for NLP and there are a number of follow up papers to this one as well. So if your interested in a deep dive, you should investigate the citation graph of this paper. Good luck!

gunshoes t1_j42zyap wrote on January 12, 2023 at 8:42 PM

Sounds like HuBERT and other MLMs used for ASR pretraining. Look for seq2seq work in the world of TTS and ASR.

Avelina9X OP t1_j4vn6su wrote on January 18, 2023 at 4:07 PM

Ahhhh! So it seems like this is something that's been explored in the slightly parallel domain of TTS and ASR rather than in pure text LMs, thanks for pointing me in this direction!

gunshoes t1_j4voru9 wrote on January 18, 2023 at 4:17 PM

Trade secret for ML: your problem is always an alteration of preexisting cv/speech/NLP framework

C0hentheBarbarian t1_j468jz4 wrote on January 13, 2023 at 1:30 PM

Its pretty old in the context of Deep Learning but openAI Jukebox uses them for audio if I remember correctly.

Avelina9X OP t1_j4vngcs wrote on January 18, 2023 at 4:09 PM

Ahah! It seems like the reason I couldn't find anything is because I was being too specific about text seq models and I was disregarding the domain of audio. Thank you!