Viewing a single comment thread. View all comments

Hsemar t1_jalp8as wrote

but does flash attention help with auto-regressive generation? My understanding was that it prevents materializing the large kv dot product during training. At inference (one token at a time) with kv caching this shouldn't be that relevant right?

0