Submitted by Zatania t3_xw5hhl in MachineLearning

So I'm doing a thesis paper using BERT and FAISS. Google Colab [haven't tried pro yet] works fine with datasets that are less than 100mb using GPU runtime. But when the dataset is bigger than that, google colab just crashed.

Will colab pro help on this or is there another alternative for this?

Edit: dataset file size that I tried that crashed colab is somewhere around 1gb to 1.5gb.

29

Comments

You must log in or register to comment.

supreethrao t1_ir4ojh3 wrote

You might want to check your data processing pipeline and maybe optimise how you’re allocation GPU RAM / System RAM. Colab pro will help but I’d suggest that try and optimise the way you deal with you data as colab free tier should easily handle datasets in the few GB range

21

Top-Perspective2560 t1_ir4pqs7 wrote

Are you trying to load the file straight into Colab or are you mounting your Google drive and loading from there?

2

incrediblediy t1_ir4ssxg wrote

what is the "sequence length" in BERT ?

3

diehardwalnut t1_ir51084 wrote

paperspace is an alternative. they have a free tier too

10

overschythe t1_ir55dtr wrote

Why don't use Jupyter notebook? If your own machine not good enough you can use AWS or other cloud. It's really cheap like $0.3 per hour for 32gb ram and 8vcpu

2

Top-Perspective2560 t1_ir5eoku wrote

Try uploading it to your Google Drive first.

Then you can mount your drive in your notebook by using:

from google.colab import drive
drive.mount(“mnt”)

Run the cell and allow access to your Drive when the prompt appears.

In the files tab on the left-hand pane you should now see a folder called mnt listed which will contain the contents of your Google Drive. To get the path to a file you can just right click on the file>copy path.

14

[deleted] t1_ir5jkx4 wrote

Random q, why FAISS over ScaNN?

3

MediumInterview t1_ir5laad wrote

Have you tried controlling the batch size (e.g. 32 or 64) and truncate the sequence length to 512?

1

minimaxir t1_ir6bbja wrote

The biggest obstacle to using ScaNN over FAISS is that ScaNN is Linux only.

FAISS can also use the GPU for larger workloads.

For ANN in practice they are close enough.

2

EmbarrassedHelp t1_ir6q39f wrote

Even with Colab Pro, you are going to run out of GPU time really quickly. Kaggle's free tier for example gives you more GPU time than Colab Pro does at the moment.

2

David202023 t1_ir74yts wrote

First, I really recommend to go pro+, I am working on colab for two years now and it is usually sufficient for nlp/vision POCs. Later, as projects get more complicated, it is recommended to move to a larger machine (im a master’s student as well and I work on the university cluster, check if there is one at your university, most likely there is one). Second, try maybe to change your training type, and work with generators instead of loading the whole data into memory at once.

2

incrediblediy t1_ir7bljy wrote

Yeah! I mean what is the seq_length used by OP :) also the batch size :) I have tried seq_length = 300 but with a small batch size in Colab, specially with AdamW instead of Adam

3

TrPhantom8 t1_ir93mm0 wrote

I regularly use colab for an efficient net and a dataset 30gb big. Though my dataset is properly written in the the tfrecord format, and it is not necessary to load it into memory

1

you-get-an-upvote t1_ir9978p wrote

FYI loading many small files from drive is very slow. If this applies to you, I recommend zipping the files, uploading to drive, copying the zipped file onto your colab machine, and unzipping.

from google.colab import drive

drive.mount('/content/drive')

!cp '/content/drive/My Drive/foo.zip' '/tmp/foo.zip'

os.chdir("/tmp")

!unzip -qq 'foo.zip'

Otherwise, if your dataloader is trying to copy files over from Drive one at a time it's going to be really slow.

Also I'd make sure you're not accidentally loading the entire dataset into RAM (assuming your crash is due to lack of RAM?).

2

labloke11 t1_iraxq76 wrote

Try Intel Devcloud. It is free and you get beefy resources.

2