Submitted by super_deap t3_11tmpc5 in MachineLearning
lucidraisin t1_jcl2rkh wrote
Reply to comment by super_deap in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
yea, literature is scant and all over the place in the efficient attention field. in this paper, i believe they claim it is query-key dimension (d_dot), but i think it should depend on the number of heads too. i don't know of any other papers that explore this topic. i just don't want people to be surprised if they fine tune to greater context lengths and things don't work as well as gpt4
super_deap OP t1_jcl3whl wrote
That is understandable. I am working with that assumption as well. (I have failed too many of such experiments to have a blind faith 🙈)
lucidraisin t1_jcl6ecd wrote
no worries, thanks for running the experiments and sharing your results 🙏
[deleted] t1_jcl50r7 wrote
[deleted]
Viewing a single comment thread. View all comments