Viewing a single comment thread. View all comments

big_ol_tender t1_jdjcfc8 wrote

The alpaca dataset has a no commercial license so idk what they are doing.. I’ve asked Stanford to change it but heard nothing back

13

Colecoman1982 t1_jdjkgjy wrote

When you asked, did you clarify that you were asking about the training data versus the whole project? The final Alpaca project was built, in part, on top of Meta's LLaMa. Since LLaMa has a strictly non-commercial license, there is no way that Stanford can ever release their final project for commercial use (as they've already stated in their initial release of the project). On the other hand, any training data they've created on their own (without needing any code from LLaMa) should be within their power to re-license. If they think you are asking for the whole project to be re-licenced, they are likely to just ignore your request.

23

MjrK t1_jdjqz9h wrote

> We emphasize that Alpaca is intended only for academic research and any commercial use is prohibited. There are three factors in this decision: First, Alpaca is based on LLaMA, which has a non-commercial license, so we necessarily inherit this decision. Second, the instruction data is based on OpenAI’s text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI. Finally, we have not designed adequate safety measures, so Alpaca is not ready to be deployed for general use.

https://crfm.stanford.edu/2023/03/13/alpaca.html

22

Esquyvren t1_jdjsw1j wrote

They said it wasn’t ready but deployed it anyways… lol

4

MjrK t1_jdk4ig1 wrote

For demonstration and research, not widely nor generally.

9

Disastrous_Elk_6375 t1_jdlix6j wrote

The demo was up for a couple of days. The first hours of it being online were rough (80-200 people in queue). It got better the following day, and better still the 3'rd day. I believe they removed the demo ~1week later. IMO they've proven a point - the demo was extremely impressive for a 7b model.

1

big_ol_tender t1_jdjl1wx wrote

I opened an issue on GitHub specifically about the data license and linked to the data bricks release :)

10

danielbln t1_jdjt8zh wrote

Why has no one regenerated the training set? With gpt3.5 that's like 50 bucks. I can be the change I want to see in the world, but am I missing something?

8

mxby7e t1_jdjzkzy wrote

The use of OpenAI’s models for generating competing models violates the term of use, which is why the Stanford dataset is restricted.

17

__Maximum__ t1_jdkepie wrote

Also, it's very shady for a company called OpenAI. They claimed they became for profit because they needed the money to grow, but these restrictions just show that they are filthy liars and only care about keeping the power and making profit. I'm sure they already have a strategy going around that 30B cap, just like they planned stealing money and talent by calling themselves non-profit first.

17

throwaway2676 t1_jdl0y80 wrote

Alpaca was only trained on 50k instructions, right? A large group of grad students or a forum like reddit could construct that many manually in a couple weeks. I'm surprised they even had to resort to using ClosedAI

8

mxby7e t1_jdl18t6 wrote

Maybe, open assistant by Stability.ai is doing this type of manual dataset collection. The training data and the model weights are supposed to be released once training is complete

11

WarAndGeese t1_jdl5t0z wrote

Boo hoo to openai, people should do it anyway. Is the terms of service the only reason not to do it or are there actual material barriers? If it's a problem of money then as long as people know how much money it can be crowdfunded. If it's a matter of people power then there are already large volunteer networks. Or is it just something that isn't practical or feasible?

7

visarga t1_jdlpae7 wrote

OpenAI has first hand RLHF data. Alpaca has second hand. Wondering if third hand is good enough and free of any restrictions.

2

lexcess t1_jdlj8tf wrote

Classy, especially when they are breezing past any copyright of the datasets they are training off of. I wonder if they can legally enforce that without creating a potentially bad precedent for themselves. Or if it could be worked around if the training was indirect through something like Alpaca.

3

ebolathrowawayy t1_jdnc05i wrote

But what if you're training a model for a narrow use-case and don't intend for anyone to use it except for a niche set of users? Is that enough to be in the clear? Or is any use of OpenAI's model output to train a model for any purpose a no-no?

1

mxby7e t1_jdncs51 wrote

From my understanding its limited to no commercial use, so you can use it for what you need, but not commercially.

1

mxby7e t1_jdktvqr wrote

The license won’t change. The dataset was collected in a way that violates the term of service of OpenAI, which they used to generate the data. If they allowed commercial use it would open them up to lawsuit.

8

visarga t1_jdlpf0h wrote

What about data generated from Alpaca, is that unrestricted?

1

impossiblefork t1_jdlddlt wrote

Model weights though, are, I assume, not copyrightable.

Is there actually a law giving Stanford any special rights to the weights?

1