Viewing a single comment thread. View all comments

harharveryfunny t1_jckltrp wrote

> I think it should be possible to replicate even GPT-4 with open source tools something like Bloom + FlashAttention & fine-tune on 32k tokens.

So you mean build a model with a 32K attention window, but somehow initialize it with weights from BLOOM (2K window) then finetune ? Are you aware of any attempts to do this sort of thing ?

10

super_deap OP t1_jckpoey wrote

I think one just needs to duplicate positional embeddings and we are good to go. Of course, there needs to be more comprehensive empirical analysis on this and I have not come across any of such attempts. I did a basic experiment and it seems to work but will have to wait and see.

9