Submitted by super_deap t3_11tmpc5 in MachineLearning
harharveryfunny t1_jckltrp wrote
> I think it should be possible to replicate even GPT-4 with open source tools something like Bloom + FlashAttention & fine-tune on 32k tokens.
So you mean build a model with a 32K attention window, but somehow initialize it with weights from BLOOM (2K window) then finetune ? Are you aware of any attempts to do this sort of thing ?
super_deap OP t1_jckpoey wrote
I think one just needs to duplicate positional embeddings and we are good to go. Of course, there needs to be more comprehensive empirical analysis on this and I have not come across any of such attempts. I did a basic experiment and it seems to work but will have to wait and see.
Sad-Comedian-711 t1_jcqgv1x wrote
This approach has been shown to work. Longformer even provided a script that did this for you: https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb
I think for flash attention you do not want to use Longformer's attention though, you want to use Big Bird's with specific block sizes or something like that.
BungaBunga6767 t1_jcl6vf9 wrote
LongFormer does it but not with FlashAttention
Viewing a single comment thread. View all comments