Submitted by JohnyWalkerRed t3_123oovw in MachineLearning

I love seeing all this great progress with LLMs being made more accessible to all, but all of the new efficient models (Dolly, Alpaca, etc.) depend on the Alpaca dataset, which was generated from a GPT3 davinci model, and is subject to non-commercial use. Are there efforts in the community to replicate this dataset for commercial use? This seems to me to be the “secret sauce”: a good quality instruction dataset you can use to “unlock” potential of smaller models.

50

Comments

You must log in or register to comment.

big_ol_tender t1_jdvu92g wrote

Thank you for posting this. I’ve raised this issue on a number of threads and even opened an issue on the alpaca repo. Everyone seems to ignore this and I’m worried about downstream issues with these models, and would love an open source alternative ( have been exploring making one myself).

19

JohnyWalkerRed OP t1_jdwjvxy wrote

Yeah like the databricks dolly post is funny to me because they are an enterprise software company and dolly is not really useful in the context they operate in. I guess they just wanted to get some publicity.

Looks like openassist, when mature, could enable this. Although it seems the precursor to an Alpaca-like dataset is an RLHF model, which itself needs human-labeled dataset, so that bottleneck needs to be solved too.

9

Taenk t1_jdwlejh wrote

The Open Assistant project is working on that as well.

2

rshah4 t1_jdxhz3d wrote

I agree with the sentiments here and don’t think it’s ok to use some of these datasets that appear to violate OpenAIs terms. I dealt with it by making a funny video: https://youtu.be/31u88EDmIwc

2

sad_dad_is_a_mad_lad t1_jdwhg8a wrote

OpenAI commercial use will not be easily enforced... They used copyright data to train their own models.

13

big_ol_tender t1_jdy0c6t wrote

100% agree but for those of us working for a company I can’t knowingly open us up to that risk even if the probability is 1%

6

Taenk t1_jdw3pn3 wrote

https://open-assistant.io / /r/openassistant

12

wind_dude t1_jdxqp0v wrote

Last I checked they still hadn't opensourced the training data... which is bizarre since they used humans to train it, with all the talk of it being opensource.

−1

ninjasaid13 t1_jdxrw65 wrote

They're going to open source it on April 15 last I heard. They're still gathering with the cut off date at April 12.

11

esquire900 t1_jdw02ut wrote

I wondered this as well. Generating one through chatGPT should be relatively cheap (in the range of ~50$ for 50.000k examples?), but I find the commercial use of it dubious. I can't really find any explicit statement on the license of data that comes out of chatGPT, or davinci or similar.

If some users here are interested, might be worth the effort to design some proper prompts, all put in some small amount and let GPT do the churning?

5

quitenominal t1_jdw15ao wrote

It's in the terms that you can't use data generated through OpenAI to compete with OpenAI - and I believe they'd be able to argue competition were the trained model to be used commercially.

See section 2.C.iii of https://openai.com/policies/terms-of-use

8

esquire900 t1_jdwh7v8 wrote

Yea I was afraid so, just hadn't found it. Thank you for pointing that out :)

1

Smallpaul t1_jdw0vx9 wrote

It seems to me that if a researcher uses OpenAI to generate an open source Instruct dataset, and a different corporation takes that dataset and uses it commercially, they are both legally in the clear unless they collude. The entity that is legally in contact with OpenAI has a legitimately non-commercial purpose and the entity doing the commercial work has no relationship with OpenAI.

2

ninjasaid13 t1_jdx9x3n wrote

can you even copyright a dataset generated by an AI?

3

Smallpaul t1_jdxf3u3 wrote

Probably not legally different than a document you created with a word processor.

−1

learn-deeply t1_jdxgxsx wrote

This isn't correct, at least in the US. AI-generated material is not considered copyrightable unless there has been significant human involvement.

https://www.federalregister.gov/documents/2023/03/16/2023-05321/copyright-registration-guidance-works-containing-material-generated-by-artificial-intelligence

6

Raywuo t1_jeal232 wrote

Exactly, they could never appeal or they would be in contradiction with themselves.

1

wind_dude t1_jdxrcpp wrote

>depend on the Alpaca dataset, which was generated from a GPT3 davinci model, and is subject to non-commercial use

Where do you get that? tatsu-lab/stanford_alpaca is apache 2.0, so you can use it for whatever.

​

for OpenAI

"""

(c) Restrictions. You may not (i) use the Services in a way that infringes, misappropriates or violates any person’s rights; (ii) reverse assemble, reverse compile, decompile, translate or otherwise attempt to discover the source code or underlying components of models, algorithms, and systems of the Services (except to the extent such restrictions are contrary to applicable law); (iii) use output from the Services to develop models that compete with OpenAI; (iv) except as permitted through the API...

"""

​

So as far as I'm concerned you are allowed to use the generated dataset for commercial purposes...

​

Only use might be the licensing on the llama models... but you can train another LLM

2

lazybottle t1_jec8i0c wrote

Alpaca is not Apache 2.0

https://huggingface.co/datasets/tatsu-lab/alpaca#licensing-information

> The dataset is available under the Creative Commons NonCommercial (CC BY-NC 4.0).

Edit: I see the source of confusion. https://github.com/tatsu-lab/stanford_alpaca

While the code is released under apache 2.0, the instruct dataset as pointed out by OP is not. One could potentially repro the steps, possibly with human ground truth, and release under a more amenable data license.

1

kawin_e t1_jdxz4bh wrote

The Stanford Human Preferences dataset (SHP): https://huggingface.co/datasets/stanfordnlp/SHP

It contains pairwise preferences for posts (so tuples (post, response_A, response B)), but you can certainly turn it into an instruction dataset by only considering responses that meet a certain cut-off. I'm currently aware of one academic/industry group that is already doing this.

2

abnormal_human t1_jdyxteq wrote

Model weights are not currently considered to be copyrightable, and there is no DMCA/RIAA/MPAA machinery providing additional consequences for "pirating" them. At least for the moment, it's not a big risk to use LLaMA/Alpaca models for commercial use so long as you have not made an agreement with Facebook not to do it.

The OpenAI policy is about competing models, and comes from the TOS of using their API. Stanford agreed to that TOS, then released the text (which is again, not copyrightable). Random people downloading that data set aren't party to that agreement or bound by it.

I'm sure that Google, Facebook, Amazon, Netflix, etc will be cautious here, but for a random smaller org, this is a risk/benefit tradeoff, not an absolute.

A person who takes a torrented LLaMA and finetunes it using the Stanford data set didn't necessarily engage in any contracts prohibiting that.

The original leaker of LLaMA weights broke the rules. That's about it. Tsk tsk.

2

rshah4 t1_jdxhesh wrote

It’s possible to pay one of the labeling companies for an instruction dataset. Right now most companies aren’t donating 50k+ datasets to the public, but I expect this will change soon.

1

ninjasaid13 t1_jdy2mgw wrote

>Right now most companies aren’t donating 50k+ datasets to the public, but I expect this will change soon.

see openassistant dataset that will be publicly released on april 15th for open-source.

4

Raywuo t1_jeadybx wrote

Well data generated by GPT cannot be used on a new IA commercially, but what about data generated from an AI that was generated from GPT data? (2 levels of abstraction) haha

1