Viewing a single comment thread. View all comments

cfoster0 t1_j4alveu wrote

FWIW in certain sense this goes against the design philosophy of transformers, which is to jointly compute all representations within a layer at once, to maximize the degree of parallelism on GPUs and other accelerators.

1