Submitted by jrmylee t3_1056uhp in MachineLearning

Hi r/machinelearning!

​

A few months ago I quit my job to join my partners to make training open-source models much faster and easier for engineers.

​

We're building Rubbrband. It's a web app that takes any ML repo off of GitHub, and gives you a Terminal and Jupyter Notebook in browser with dependencies and GPUs automatically set up.

​

Why did we build this?

My co-founders and I have been working on this because we found this dependency set up process super tedious and draining as researchers.

​

What's included?

- Automatic Dependency set up for any GitHub python repo

- Integrated Terminal and Notebooks

- A server with an Nvidia GPU

- Code explanations for functions

- Our pricing is simple at $75/month for 3 repos running at a time. First week is free.

​

I'd love to get your feedback on:

  1. Does the value we provide resonate with you? Would you try it out?
  2. Is dependency and environment set up take up a large chunk of your time?

We're currently working on acquiring more GPUs to onboard more users, but if you'd like access to the product please let me know.

​

Thank you very much in advance!

50

Comments

You must log in or register to comment.

JackBlemming t1_j3997nh wrote

Couple thoughts:

  1. Setting up an environment is typically harder than cloning the repo and running pip install on the requirements.txt file. Many python packages require prior linux packages to have been installed beforehand. Your service should ideally take care of this for me. Some obvious examples are opencv, cuda/gpu drivers, mysqlclients etc.

  2. Dataset management is the most annoying part of machine learning for me, not setting up environments which is typically a dockerfile or docker-compose file, and maybe one shell script to bootstrap everything. Dataset management being allowing my models to access the dataset in a fast way, updating the dataset, etc. Ideally your service should make it easy to upload data to your dataset and then make it accessible to the training code. This is assuming you want to allow people to train models on the service.

34

jrmylee OP t1_j39ara0 wrote

  1. Great point, we have this covered. We intelligently install apt dependencies alongside pip dependencies. CUDA drivers are also all installed properly.
  2. This makes sense. If I understand you correctly, is the difficult part: uploading/managing dataset to server easily + writing data loaders to feed into the model?
5

JackBlemming t1_j39baqq wrote

Per 2. yes, exactly right. Some of my datasets are millions of images with metadata. As you can imagine, uploading and consuming this magnitude is slow and tedious, and then integrating it with the remote machine actually running the training script.

7

jrmylee OP t1_j39dpr0 wrote

Got it, appreciate it the feedback!

4

i_ikhatri t1_j3xlleh wrote

Just to add onto this feedback (because I think /u/JackBlemming is 100% correct) you would probably benefit from storing some of the most popular datasets (ImageNet, MS COCO, whatever is relevant to the fields you're targeting) somewhere in the cloud where you can provide fast read access (or fast copies) to any number of training workers that get spun up.

Research datasets tend to be fairly standardized so I think you could get a high amount of coverage by just having a few common datasets available. I only gave computer vision examples because that's what I'm most familiar with but if you get a few CV datasets, a few NLP ones etc. you should be able to provide a killer UX.

Bonus points if you're somehow able to configure the repos to read from the centralized datastore properly automatically (though this is probably difficult/impossible).

2

the__itis t1_j3dogz7 wrote

Also if not 100% native, compiling libraries for architectures is a big complication.

1

RuairiSpain t1_j39q68d wrote

I like the idea. I work for a large enterprises on their ML platform team, providing similar services internally to all DEV, ML and analytics teams. I think there is a business in it, it is a competitive space but the accusation potential is great (to be bought over and merged into a larger org).

I suggest you check out https://www.gitpod.io, which does more general provisioning of GitOps clusters/Pods in their managed Kubernetes clusters. It's not specifically ML, but we've looked at it for POC ML projects that want basic hosting.

Also check out: https://github.com/ml-tooling/ml-workspace, it a nice open source project with lots of packages ready to use.

And Jupyterlabs offering, they'll be your main competition on pricing.

You are going to have a headache with Python version compatibility with your base dependencies, the onces used on GitHub, and the ones needed by Jupyter Notebooks. Same with CUDA drivers, suggest you lock down the AWS node instance types, so it's less confusing for end users.

If you are turning it into a business, I'd recommend you have a tier approach to size of ML project. Simple POC ML projects with a tiny dataset, is a good starting point for most people. But then people was data ingest, cleaning, ETL to Big Data and enterprise sources; this gets complex fast (and where most teams waste time and money). Either keep your focus on POCs, and grow it's as a ML hosting company for SMEs; or embrace the ETL side and simplify data ingest for larger enterprise companies. The second option is more a consulting business but you can charge high fees.

ML ETL space: https://www.ycombinator.com/companies/dagworks-inc

https://www.gathr.one/plans-pricing/

https://www.snowflake.com/en/

Of these 3 ETL companies, I've played with Snowflake and like what they do and their direction. Especially like they acquired https://streamlit.io/ which is a fun way to deploy Python apps without dealing with infrastructure and devOps tasks.

My final comment, include data ingest and ETL in your story to customers. ML training and deploying training pipelines is not where DS people spend their time, 80% is spent on data collection, reshaping and validation.

FYI, I think you'll burn through $75 very quickly for a Nvidia GPU. I presume you are running these in on-demand and not spot prices. That monthly price seems generous for an average ML training pipeline.

14

jrmylee OP t1_j39rh76 wrote

Got it that makes a lot of sense. We'll definitely be focusing on POC projects. For me, I mainly wanted a better, faster version of Google Colab. It's difficult to compete with their offering due to their free tier, but we think solving the problems of Colab is still worthwhile.

I'm wondering, would you or anyone you know be willing to give this a spin? It would really help us to know if

a. product works on a variety of repos

b. UI is fully-functionally and easy to use

3

RuairiSpain t1_j39zy50 wrote

Sure DM me and I'll send my email address. I don't have much time to spend in it, but will give it a spin.

You've looked at huggyface? They have an elegant way to package the sample dataset with notebooks, and their documents are easy to digest.

3

_Arsenie_Boca_ t1_j3bbl7h wrote

For me personally, it would be very important that I am not tied to Jupyter Notebooks. Ideally integrate vscode and automatically load settings from .vscode directory in the repo

6

jrmylee OP t1_j3dl18u wrote

Gotcha. We have a VSCode editor built in, but haven't implemented a .vscode integration yet so we'll add that in.

Do you also use Github Copilot?

1

_Arsenie_Boca_ t1_j3dntcr wrote

Awesome. I dont, but there is a VSCode extension, so that would be integrated already. Or do you have any special integration of copilot?

1

jrmylee OP t1_j3dp0tc wrote

Yeah we don't currently have extensions implemented(not sure it's possible in a 3rd party web-app actually).

​

I've been using Copilot a ton so was curious if you were also using it

1

_Arsenie_Boca_ t1_j3dpfxv wrote

Ah ok, I didnt know that was an issue. Extensions are really important so you should definitely look into that

1

montcarl t1_j3a96mi wrote

It's a good idea that already has a lot of options: Google colab, codespaces, binder, sagemaker, kaggle notebooks, etc

5

jrmylee OP t1_j3dlb06 wrote

Yeah definitely true. I've use all of these except sagemaker, and I felt the solutions weren't perfect for my workflows. I guess we're figuring out if other people feel the same way!

2

chief167 t1_j3borij wrote

First thought: decide for yourself who your target audience is

If you hope to sell this to companies, or even start-ups, be prepared for a lot of questions around data governance, security, ....

Second: do you have an idea how many users you need for break even and how the infrastructure needs to scale to cope with that? Gpu's aren't cheap of course, neither is electricity or cloud providers

3

Aggravating-Act-1092 t1_j3bzclk wrote

I think it’s interesting but for a hobbyist the pricing is too high. I would say some kind of tiered access would allow you to casually try it before committing.

I would like to try it, and I can afford $75/m, but it’s too much for something casual which I might forget about. Codec and MidJourney I both signed up to straight away.

3

jrmylee OP t1_j3dll1u wrote

OK got it that makes sense. We're actually trying to figure out how to do a tiered access, possibly a free tier with CPUs only might make sense.

​

I'll also DM you with a link to try the app!

2

brucebay t1_j3bb7t2 wrote

This seems to be a very ambitious project, as there are several ML projects that have very obscure dependencies that dont't work out of the box. This is especially true for older repos. I would personally be very interested at a reasonable price level (compatible to vast ai or runpod) to check out some repos without hassling with setup.

​

But I'm just a hobbyist. In a professional environment, I don't know if I would be interested in an automated ML setup for a long term development/production solution. My company uses H2O, DataRobot, some IBM solution, (and another one but forget the name). They have some attractive features for everyday data analysts, but mostly limits the advanced users. In a corporate environment, your solution seems to fit between an expert developer who does all work, and an AutoML solution that makes most of the work.

​

I think it is great idea for rapid experimentation for middle-high end development . So I suspect your target audience for those features are going to be either educational institutions, or any kind of research centers, be it military or commercial. I hope it will have enough interest to support you financially. Good luck with your company.

​

ps: if you can find a way to let users download the environment where the target repo runs (or provide a tool to mirror) for local development (may be at an extra cost) it would be very useful feature for most people. I would even pay for such a stand-alone software.

2

jrmylee OP t1_j3dnlv6 wrote

>This seems to be a very ambitious project, as there are several ML projects that have very obscure dependencies that dont't work out of the box. This is especially true for older repos. I would personally be very interested at a reasonable price level (compatible to vast ai or runpod) to check out some repos without hassling with setup.

Yeah that makes sense, appreciate the feedback! We're hoping it works out as well haha

You mentioned users between expert and non-technical folks, and we think that this is intended audience for our app. Most of this is due to the fact that we're building this for ourselves(as recent ML grad students), and it made sense to us to solve a problem we're familiar with.

I also DM'd you a link to the app, if you have time to check it out would appreciate your feedback.

1

AGI_69 t1_j3bw8mp wrote

Error happened while submitting your request. Please try again later.Error happened while submitting your request. Please try again later.
2

jrmylee OP t1_j3dylup wrote

How did you run into this error?

1

AGI_69 t1_j3e0uou wrote

Tried to join the waitlist. Works now

1

GFrings t1_j3cc42r wrote

I'm sort of confused as to what this buys me as a developer. Sure, I can run the model with one click maybe. But that doesn't seem to get me any closer to my typical goal which is to have a module that I can drop right into my code base and use the model to solve one sub problem of a much larger system. I can see using this as sort of a fast way to demo a model maybe, but most repos are clean enough that it takes maybe 30 minutes to reproduce the environment and run the author model myself.

There are already a lot of open source tools that solve the other problem, by the way. One is pytorch-liberator which can extract all the code and dependencies from a module and package in a nice portable format for integration elsewhere.

As a general tip to you and your sales team, when you go to market with something like this you should have some value propositions lined up already instead of asking us whether we think it's valuable. Most folks will tend to assume not unless you can help them see what makes this useful.

2

fakesoicansayshit t1_j3d8bg3 wrote

All I really need is a storage unit that doesn't make me move large GB files up and down like colab (takes forever, has to be done everytime), and that lets me use a A100 on the fly when needed (instead of confusing compute units bs) without having to switch runtimes (which makes you move data again).

1

The_Rational_Player t1_j3ewfe3 wrote

Really feel like this is a redundancy based on what's available out there todate.

1