SeucheAchat9115 t1_izdbkkz wrote on December 8, 2022 at 6:56 AM

Try to use smaller subsets of your data. It is very likely that the performance scales with the amount of data afterwards.

fasttosmile t1_izgxj4n wrote on December 9, 2022 at 12:50 AM

Careful. There are literally dozens of LMing papers that get an improvement on PTB which do not scale to larger datasets.

farmingvillein t1_izi021q wrote on December 9, 2022 at 6:21 AM

True, but no one has really come up with a better methodology.

The best you can do is train on smaller data + make sure that you can tell yourself a story about how the new technique will still help when data is scaled up (and then hope that you are right).

(The latter is certainly argument for staying at least semi-current with the literature, as it will help you get an intuition for what might scale up and what probably won't.)

SeucheAchat9115 t1_izdbmzj wrote on December 8, 2022 at 6:57 AM

Or you could compare your training after e.g. two epochs and only run the best for 500 Epochs

VirtualHat t1_izczlg3 wrote on December 8, 2022 at 4:46 AM

I have a system where I can go from idea to initial results in 2-hours and full results by the next day. I've found a short loop like this critical for testing the hundreds of ideas that come to mind.

1bir t1_izelhl5 wrote on December 8, 2022 at 3:24 PM

>I have a system where I can go from idea to initial results in 2-hours

I think the OP is asking for a description of that...

VirtualHat t1_izfu724 wrote on December 8, 2022 at 8:15 PM

I use three scripts.

train.py (which trains my model)

worker.py (which picks up the next job and runs it using train.py)

runner.py (which is basically a list of jobs and code to display what's happening).

I then have multiple machines running multiple instances of worker.py. When a new job is created, the workers see it and start processing it. Work is broken into 5-epoch blocks, and at the end of each block, a new job from the priority queue is selected.

This way I can simply add a new job and within 30 minutes or so one of the workers will finish its current block and pick it up. Also because of the chunking, I get early results on all the jobs rather than having to wait for them to finish. This is important as I often know early on if it's worth finishing or not.

I evaluate the results in a Jupyter notebook using the logs that each job creates.

edit: fixed links.

moyle t1_izgsce9 wrote on December 9, 2022 at 12:10 AM

Guild.ai can easily automate this pocess. I really recommend checking it out

VirtualHat t1_izgvx9j wrote on December 9, 2022 at 12:38 AM

This looks great.

RSchaeffer t1_izgxqod wrote on December 9, 2022 at 12:52 AM

These links don't work for me. Can you double check them?

thundergolfer t1_izgyu6x wrote on December 9, 2022 at 1:01 AM

They're not actually links, they've just been formatted like they are. They just link to train.py which is not a website.

VirtualHat t1_izjmbm0 wrote on December 9, 2022 at 4:21 PM

Oh my bad, didn't realise Reddit automatically created links when writing abc.xyz. I've edited the reply to include links to my code.

AmalgamDragon t1_izfizm8 wrote on December 8, 2022 at 7:03 PM

Pics or it didn't happen (i.e. please share the details of this system).

iamr0b0tx t1_izgdmkj wrote on December 8, 2022 at 10:24 PM

Check out weights and biases. I believe it can help you manage multiple experiments. As for speed you may be able to test them concurrently once you have them all set up separately. And I think someone already mentioned you can use a smaller dataset to make the process faster

[deleted] OP t1_izghg4i wrote on December 8, 2022 at 10:51 PM

[deleted]

iamr0b0tx t1_izgmd6p wrote on December 8, 2022 at 11:26 PM

It helps you manage experiments as a researcher

mlisnifty t1_izk4hvw wrote on December 9, 2022 at 6:15 PM

Yea, I'd keep my data lineage for each project stored in something like CometML. I'd probably create a different project for each idea.. so multiple training runs would be in each project, then you've got all your graphics you need to compare models of the same project, hyperparameters, code, dependencies, and data all ready for you if you decide to come back to one of the projects after chasing something else for a month.

thundergolfer t1_izgiaa4 wrote on December 8, 2022 at 10:57 PM

I'm sorry to shill, but Modal.com is easily the best thing for this. Here's a demo video should how fast you can edit code, run it in the cloud, and then edit it some more all in a handful of seconds.

I was the ML Platform lead at Canva and quick iteration was the #1 pain point of our data scientists and MLEs. I left Canva to join Modal because it can do heavy serverless compute and keep your inner dev loop tight.

Again, sorry to shill, but I've been in this sub for like 8 years and think tools like Modal and Metaflow are finally getting us to a place where ML development isn't a painful mess.

GinoAcknowledges t1_izl699d wrote on December 9, 2022 at 10:23 PM

This is great. I would encourage my organization to use this, except the restriction to T4 GPUs renders this somewhat unusable for us. What’s the ETA on more modern GPUs?

[deleted] OP t1_izgm9z4 wrote on December 8, 2022 at 11:25 PM