Submitted by Zatania t3_xw5hhl in MachineLearning

So I'm doing a thesis paper using BERT and FAISS. Google Colab [haven't tried pro yet] works fine with datasets that are less than 100mb using GPU runtime. But when the dataset is bigger than that, google colab just crashed.

Will colab pro help on this or is there another alternative for this?

Edit: dataset file size that I tried that crashed colab is somewhere around 1gb to 1.5gb.

29

Comments

You must log in or register to comment.

supreethrao t1_ir4ojh3 wrote

You might want to check your data processing pipeline and maybe optimise how you’re allocation GPU RAM / System RAM. Colab pro will help but I’d suggest that try and optimise the way you deal with you data as colab free tier should easily handle datasets in the few GB range

21

No_Bullfrog5936 t1_ir927xk wrote

This… I’m finetuning a large wav2vec2 model on colan with a 1+ GB dataset, no issues here

3

diehardwalnut t1_ir51084 wrote

paperspace is an alternative. they have a free tier too

10

FirstBabyChancellor t1_ir55gtb wrote

Their free machines are almost never available, in my experience. Also, all notebooks in their free tier are publicly available, which may be a major downside for some folks.

8

incrediblediy t1_ir4ssxg wrote

what is the "sequence length" in BERT ?

3

minimaxir t1_ir6bisc wrote

The amount of tokens in the input. Sequence length requires quadratic scaling compute.

Pretrained BERT takes in a maximum of 512 tokens.

2

incrediblediy t1_ir7bljy wrote

Yeah! I mean what is the seq_length used by OP :) also the batch size :) I have tried seq_length = 300 but with a small batch size in Colab, specially with AdamW instead of Adam

3

[deleted] t1_ir5jkx4 wrote

Random q, why FAISS over ScaNN?

3

minimaxir t1_ir6bbja wrote

The biggest obstacle to using ScaNN over FAISS is that ScaNN is Linux only.

FAISS can also use the GPU for larger workloads.

For ANN in practice they are close enough.

2

Top-Perspective2560 t1_ir4pqs7 wrote

Are you trying to load the file straight into Colab or are you mounting your Google drive and loading from there?

2

Zatania OP t1_ir5dmxw wrote

load straight into colab

like as a test, i downloaded 1 gb dataset from kaggle into colab directly

1

Top-Perspective2560 t1_ir5eoku wrote

Try uploading it to your Google Drive first.

Then you can mount your drive in your notebook by using:

from google.colab import drive
drive.mount(“mnt”)

Run the cell and allow access to your Drive when the prompt appears.

In the files tab on the left-hand pane you should now see a folder called mnt listed which will contain the contents of your Google Drive. To get the path to a file you can just right click on the file>copy path.

14

Zatania OP t1_ir5kskz wrote

I'll try this solution if this works, will get back to you.

2

you-get-an-upvote t1_ir9978p wrote

FYI loading many small files from drive is very slow. If this applies to you, I recommend zipping the files, uploading to drive, copying the zipped file onto your colab machine, and unzipping.

from google.colab import drive

drive.mount('/content/drive')

!cp '/content/drive/My Drive/foo.zip' '/tmp/foo.zip'

os.chdir("/tmp")

!unzip -qq 'foo.zip'

Otherwise, if your dataloader is trying to copy files over from Drive one at a time it's going to be really slow.

Also I'd make sure you're not accidentally loading the entire dataset into RAM (assuming your crash is due to lack of RAM?).

2

alesi_97 t1_ir742br wrote

Bad advice

Google Drive access bandwidth is limited and far lower than the Google Colab runtime’s (temporary) HDD storage

Source: worked on training CNN for my bachelor’s thesis

2

Top-Perspective2560 t1_ir86fh8 wrote

It may actually solve the problem. I’ve run into similar issues before.

Source: CompSci PhD. I use Colab a lot.

3

Sonoff t1_ir5dzpa wrote

Well put files in your google drive and mount your drive

3

overschythe t1_ir55dtr wrote

Why don't use Jupyter notebook? If your own machine not good enough you can use AWS or other cloud. It's really cheap like $0.3 per hour for 32gb ram and 8vcpu

2

EmbarrassedHelp t1_ir6q39f wrote

Even with Colab Pro, you are going to run out of GPU time really quickly. Kaggle's free tier for example gives you more GPU time than Colab Pro does at the moment.

2

David202023 t1_ir74yts wrote

First, I really recommend to go pro+, I am working on colab for two years now and it is usually sufficient for nlp/vision POCs. Later, as projects get more complicated, it is recommended to move to a larger machine (im a master’s student as well and I work on the university cluster, check if there is one at your university, most likely there is one). Second, try maybe to change your training type, and work with generators instead of loading the whole data into memory at once.

2

labloke11 t1_iraxq76 wrote

Try Intel Devcloud. It is free and you get beefy resources.

2

MediumInterview t1_ir5laad wrote

Have you tried controlling the batch size (e.g. 32 or 64) and truncate the sequence length to 512?

1

TrPhantom8 t1_ir93mm0 wrote

I regularly use colab for an efficient net and a dataset 30gb big. Though my dataset is properly written in the the tfrecord format, and it is not necessary to load it into memory

1

Dear-Acanthisitta698 t1_ir4rpk8 wrote

Maybe using numpy instead of python list can help

0

Zatania OP t1_ir5dsre wrote

im using numpy

my problem was during the faiss indexing of the embeddings, this is where it crashes

2