Submitted by Zealousideal-Copy463 t3_10khmxo in deeplearning

So I've been wondering for a while if maybe I should get a 4090, or if I should just use AWS or something.

For context: I work at a tech company and we use tensorflow/pytorch so I have a decent experience with that. I have used mostly AWS to train and test things. The problem is, in my experience, moving data from S3 to Sagemaker is a pain in the ass, and I have only used like 1-2 GB of data, mostly tabular data.

Now I want to test a few things myself, train some image models. I've been playing with some models and I got 100 GB of data that I want to fit a model with. I have tried with Colab and the data in google drive, but drive gets confused with multiple files so it's really annoying.

Any suggestions on how to do this in the cloud? I also have some experience with GCP and Azure, but AWS is the provider I have the most experience with. Can I do this without suffering too much when handling data around or should I just buy a 4090 and train stuff locally?

10

Comments

You must log in or register to comment.

agentfuzzy999 t1_j5qy82t wrote

“Should I just buy a 4090”

Ok Jeff Bezos

4090 clock speed is going to be faster than similar instances that use T4s, plus wayyyyyyy more CUDA cores. Training will be significantly faster, if you can fit the model on the 4090. If you can “business expense” a 4090 for your own machine, good lord do that.

14

v2thegreat t1_j5s39fb wrote

It really depends on how often you think you'll train with the mode

If it's something that you'll do daily for at least 3 months, then I'd argue you can justify the 4090.

Otherwise, if this is a single model you want to play around with, then use an appropriate ec2 instance with gpus (remember: start with a small instance and then upgrade the instance as you need more compute, and remember to turn off your instance when you're not using it)

I don't really know what type of data you're playing around with (if it's image, text, or audio data for example), but you should be able to get pretty far without using a by doing small scale experiments and debugging, and then finally using a gpu for the final training

You can also use tensorflow datasets that have the ability to stream data from disk during training time, meaning that you won't need to store all of your files in memory during training, and be able to get away with a fairly decent computer.

Good luck!

7

FuB4R32 t1_j5tdt1c wrote

We use Google cloud buckets + tensorflow - it works well since you can always point a VM to a cloud bucket (e.g. tfrecords) and it just has access to the data. I know you can do something similar in Jax, haven't tried pytorch. It's the same in a Colab notebook. Not sure if you can point it to a cloud location from local machine though but as others are saying the 4090 might not be the best use case (e.g. you can use a TPU in a Colab notebook to get similar performance)

1

Zealousideal-Copy463 OP t1_j5tih5j wrote

Sorry, I wrote it in a hurry and now I realize it came out wrong.

What I meant is that in my experience dealing with: moving data between buckets/vms, uploading data, logging into a terminal via ssh or using notebooks that crash from time to time (Sagemaker is a bit buggy), or just training cloud models has some annoyances that are hard to avoid and make the whole experience horrible. So, maybe I should "just buy a good GPU" (4090 is a "good" deal where I live) and stop trying stuff around in the cloud.

1

Zealousideal-Copy463 OP t1_j5tj30h wrote

My first idea was a 3090, but I'm not based in the US, and getting a used GPU here is risky, it's easy to be scammed. A 4080 is around 2000$ here, 3090 new is 1800$, and a 4900 is 2500$. So I thought that if I decide to get a desktop, I should "just" go for the 4090 cause is 500-700$ more but I'd get double the speed than a 3090 and 8+ vram.

1

Zealousideal-Copy463 OP t1_j5tjusi wrote

Thanks for your comment! I have tried using ec2 and keeping data in EBS but not sure if it's the best solution, what is your workflow there?

I'm playing around mostly with NLP and image models. Right now I'm trying to process videos, like 200GB for a retrieval problem, what I do is: get frames, get feature vectors from pre trained resnet, and resnext (this takes a lot of time). And then I train a siamese network on all of those vectors. As I said I have tried with s3 and sagemaker, but I have to move data into sagemaker notebooks and I waste a lot of time there. Also tried to process stuff in ec2 but setting the whole thing took me a while (downloading data, installing libraries, creating scripts in the shell to process videos, etc).

1

Zealousideal-Copy463 OP t1_j5tk24z wrote

Ohh, I didn't know that about GCP, so you can point a VM to a bucket and it just "reads" the data? you don't have to "upload" data into the VM?

As I said in a previous comment, my problem with AWS (S3 and Sagemaker), is that the data is in a different network, and even though is still an AWS network, you have to move data around and that takes a while (when it's 200 GB of data).

1

v2thegreat t1_j60fvmd wrote

Well, there are ec2 instances that are already setup. How often do you do this sort of thing? It might be justified to build your own home setup, but as someone who does that themselves, I can tell you it's kinda tedious and you end up being your own IT guy

1

Zealousideal-Copy463 OP t1_j61j6n2 wrote

I was checking Marketplace, couldn't find any used below 1500$. Also, I just discovered that 3090 is 2.2k$ here now lol (that would be the cheapest option)... meanwhile in BestBuy it costs 1k$, was just thinking about traveling to the US with the other k lol

1