_Arsenie_Boca_

_Arsenie_Boca_ t1_ityccjh wrote

I dont think there is a fundamental difference between cv and nlp. However, we expect language models to be much more generalist than any vision model (Have you ever seen a vision model that performs well on discriminative and generative tasks across domains without finetuning?) I believe this is where scale is the enabling factor.

1

_Arsenie_Boca_ t1_isnyhtb wrote

You should test if this happens only during training or also when evaluating on the train set afterwards. As others have mentioned, dropout could be a possible factor. But you should also consider that the train accuracy is calculated during the training process, while the model is still learning. I.e. the final weights are not reflected in the average train acc.

1

_Arsenie_Boca_ t1_isbgyhs wrote

If the hardware is optimized for it, there probably is not a huge difference in speed, but the performance gain is probably negligible too.

The real reason people dont use 64bit is mainly memory usage. When you train a large model, you can fit much bigger 32bit/16bit batches into memory and thereby speed up training.

3
5

_Arsenie_Boca_ t1_irwzk3j wrote

The point is that you cannot confirm the superiority of an architecture (or whatever component) when you change multiple things. And yes, it does matter where an improvement comes from, it is the only scientfically sound method to improve. Otherwise we might as well try random things until we find something that works.

To come back to LSTM vs Transformers: Im not saying LSTMs are better or anything. Im just saying that if LSTMs would have received the amount of engineering attention that went into making transformers better and faster, who knows if they might be similarly successful?

8

_Arsenie_Boca_ t1_irw1ti7 wrote

No, I believe you are right to think that an arbitrary image captioning model cannot accurately generate prompts that actually lead to a very similar image. Afterall, the prompts are very model-dependent.

Maybe you could use something similar to prompt tuning. Use a number of randomly initialized prompt embeddings, generate an image and backprop the distance between your target image and the generated image. After convergence, you can perform a nearest neighbor search to find the words closest to the embeddings.

Not sure if this has been done, but I think it should work reasonably well

19

_Arsenie_Boca_ t1_irvjdtn wrote

I dont have the papers on hand that investigate this, but here are 2 things that dont make me proud of being part of this field.

Are transformers really architecturally better than LSTMs or is their success mainly due to the huge amount of compute and data we throw at them? More generally, papers tend to make many changes to a system and credit the improvement to the thing they are most proud of without a fair comparison.

Non-opensource models like GPT3 dont make their training dataset public. People evaluate the performance on benchmarks but nobody can say for sure if the benchmark data was in the training data. ML used to be very cautious about data leakage, but this is simply ignored in most cases when its about those models.

91

_Arsenie_Boca_ t1_irrcmwh wrote

If all you want to see is the two curves close to each other, I guess you could size up the model, so that it overfits terribly. But is that really desirable?

If my assumption that you predict the whole graph autoregressively is correct, then I believe it works just fine. You should check the forecast horizon and think about what it is you want to achieve in the end

1

_Arsenie_Boca_ t1_irr9bb4 wrote

No this is not a modelling issue. It actually isnt a real issue at all. Predicting a very long trajectory is simply very hard. At each timestep, a slight error will occur which will exponentiate, even if the error per timestep is marginal. Imagine being asked to predict a certain stock price. Given some expertise and current information, you might be able to do it for tomorrow, but can you do it precisely for the next year?

1

_Arsenie_Boca_ t1_iqze3a1 wrote

As many others have mentioned, the decision boundaries from piecewise linear models are actually quite smooth in the end, given a sufficient amount of layers.

But to get to the core of your question, why would you prefer many stupid neurons over few smart ones. I believe there is a relatively simple explanation why the former is better. Having more complex neurons would mean that the computational complexity goes up while the number of parameters stays the same. I.e. with the same compute, you can train bigger (number of params) models if the neurons are simple. A high number of parameters is important for optimization as extra dimensions can be helpful in getting out of local minima. Not sure if this has been fully explained, but it is in part the reason why pruning works so well: we wouldnt need that many parameters to represent a good fit, but it is much easier to find in high dimensions, from where we can prune down to simpler models (only 5% of parameters with almost same performance).

1