Viewing a single comment thread. View all comments

Dependent_Change_831 t1_j1hc9x4 wrote

You’re getting a lot of hate but IMO you’re totally right. Python may be convenient short term but it really does not scale.

I’ve been a working in ML for around 8 years now and I’ve just joined a new project where we’re training models trained on billions of pieces of dialog for semantic parsing, and it took us weeks to fix a bug in the way torch and our code would interact with multiprocessing…

There are memory leaks caused by using Python objects like lists in dataset objects, but only if you use DistributedDataParallel (or a library like DeepSpeed, i.e. multiple processes)…

Loading and saving our custom data format requires calling into our own C++ code to avoid waiting hours to deserialize data for every training run…

Wish I could say there’s a better alternative now (due to existing community resources) but we can hope for the future.

2

vprokopev OP t1_j1j0xs5 wrote

Thank you for sharing experience!.

My intuition is that C++ python extensions make it easier to to easy things (then in C++) but make it harder to do hard things.

People always go for convenience first and then move to something more fundamental and flexible.

Data Science was mostly in R and MATLAB about 12-15 years ago. Then people moved to more general python. Next step is a compiled language with static types imo.

1