Viewing a single comment thread. View all comments

suflaj t1_j63bf1q wrote

Well for starters, it would probably have worse performance due to so many redundant features, and it would be much slower.

Remember that the embedding layer carries loads of overhead, as we're talking V * d matrices. So for a corpus of 250k and embedding vector of 768, ex., we're talking about 192M parameters just for the embedding layer. Maybe you can save some space by having a sparse embedder, but find me a free implementation of sparse layers that work as well as dense ones. Other than that, the 192M parameters are, before compression techniques, equivalent to 768M. And that's just in memory, and the gradient, unless sparsified, will be 768M PER BATCH.

This is without mentioning that you would likely need to increase the embedding dim to account for the 8 times times bigger vocabulary.

2