Viewing a single comment thread. View all comments

avocadoughnut t1_j7yaq8w wrote

I'm considering a higher level idea. There's no way that transformers are the end-all-be-all model architecture. By identifying the mechanisms that large models are learning, I'm hoping a better architecture can be found that reduces the total number of multiplications and samples needed for training. It's like feature engineering.

8