v2thegreat

v2thegreat t1_j5s39fb wrote

It really depends on how often you think you'll train with the mode

If it's something that you'll do daily for at least 3 months, then I'd argue you can justify the 4090.

Otherwise, if this is a single model you want to play around with, then use an appropriate ec2 instance with gpus (remember: start with a small instance and then upgrade the instance as you need more compute, and remember to turn off your instance when you're not using it)

I don't really know what type of data you're playing around with (if it's image, text, or audio data for example), but you should be able to get pretty far without using a by doing small scale experiments and debugging, and then finally using a gpu for the final training

You can also use tensorflow datasets that have the ability to stream data from disk during training time, meaning that you won't need to store all of your files in memory during training, and be able to get away with a fairly decent computer.

Good luck!

7

v2thegreat t1_j2oablu wrote

For transformers that's likely a difficult question to answer without experimentation, but I always recommend to start small. It's generally hard enough to go from 0 to 1 without also worrying about scaling things up.

Currently, we're seeing that larger and larger models aren't really slowing down and continue to become more powerful.

I'd say that this deserves it's own post rather than a simple question.

Good luck and please respond when you end up solving it!

1

v2thegreat t1_j2o816v wrote

Well, to answer your original question: it depends on what problem you're trying to solve!

In theory yes you can work with a large corpus of data with a large language model, but as chatgpt showed us, it's not necessarily the case that a larger model will do better always, but rather that fine-tuning might give better results

I hope this helps!

1

v2thegreat t1_j2o5v0y wrote

It can, but I want to know why you want to use transformers in the first place. Having the entire context is important to avoid solving the wrong problem, especially one that might get expensive depending on what you're trying to do

1

v2thegreat t1_j2lpumb wrote

These comes under hyperparameter optimization, so you will definitely need to play around with them, but here are my rules of thumb (take it with a grain of salt!)

Learning rate: start with a large learning rate (ex 10e-3), and if the model overfits, then reduce it down to 10e-6. There's a stackoverflow article that explains this quite well.

Number of epochs: it's right before your model's loss starts diverging from the validation loss. Plot them out and where they diverge is where the overfitting happens.

Batch size: large enough that the data fits in memory to speed things up in general

3