Viewing a single comment thread. View all comments

lucidraisin t1_jcl2rkh wrote

yea, literature is scant and all over the place in the efficient attention field. in this paper, i believe they claim it is query-key dimension (d_dot), but i think it should depend on the number of heads too. i don't know of any other papers that explore this topic. i just don't want people to be surprised if they fine tune to greater context lengths and things don't work as well as gpt4

2

super_deap OP t1_jcl3whl wrote

That is understandable. I am working with that assumption as well. (I have failed too many of such experiments to have a blind faith 🙈)

2

lucidraisin t1_jcl6ecd wrote

no worries, thanks for running the experiments and sharing your results 🙏

2