I love seeing all this great progress with LLMs being made more accessible to all, but all of the new efficient models (Dolly, Alpaca, etc.) depend on the Alpaca dataset, which was generated from a GPT3 davinci model, and is subject to non-commercial use. Are there efforts in the community to replicate this dataset for commercial use? This seems to me to be the “secret sauce”: a good quality instruction dataset you can use to “unlock” potential of smaller models.

Comments

You must log in or register to comment.

big_ol_tender t1_jdvu92g wrote on March 27, 2023 at 3:57 PM

Thank you for posting this. I’ve raised this issue on a number of threads and even opened an issue on the alpaca repo. Everyone seems to ignore this and I’m worried about downstream issues with these models, and would love an open source alternative ( have been exploring making one myself).

JohnyWalkerRed OP t1_jdwjvxy wrote on March 27, 2023 at 6:41 PM

Yeah like the databricks dolly post is funny to me because they are an enterprise software company and dolly is not really useful in the context they operate in. I guess they just wanted to get some publicity.

Looks like openassist, when mature, could enable this. Although it seems the precursor to an Alpaca-like dataset is an RLHF model, which itself needs human-labeled dataset, so that bottleneck needs to be solved too.

Taenk t1_jdwlejh wrote on March 27, 2023 at 6:50 PM

The Open Assistant project is working on that as well.

rshah4 t1_jdxhz3d wrote on March 27, 2023 at 10:23 PM

I agree with the sentiments here and don’t think it’s ok to use some of these datasets that appear to violate OpenAIs terms. I dealt with it by making a funny video: https://youtu.be/31u88EDmIwc

sad_dad_is_a_mad_lad t1_jdwhg8a wrote on March 27, 2023 at 6:25 PM

OpenAI commercial use will not be easily enforced... They used copyright data to train their own models.

big_ol_tender t1_jdy0c6t wrote on March 28, 2023 at 12:38 AM

100% agree but for those of us working for a company I can’t knowingly open us up to that risk even if the probability is 1%

Taenk t1_jdw3pn3 wrote on March 27, 2023 at 4:58 PM

https://open-assistant.io / /r/openassistant

wind_dude t1_jdxqp0v wrote on March 27, 2023 at 11:27 PM

Last I checked they still hadn't opensourced the training data... which is bizarre since they used humans to train it, with all the talk of it being opensource.

ninjasaid13 t1_jdxrw65 wrote on March 27, 2023 at 11:36 PM

They're going to open source it on April 15 last I heard. They're still gathering with the cut off date at April 12.

KungFuScubaMaster t1_jdvmcyw wrote on March 27, 2023 at 3:06 PM

Just adding, I'm also very interested in this!

esquire900 t1_jdw02ut wrote on March 27, 2023 at 4:35 PM

I wondered this as well. Generating one through chatGPT should be relatively cheap (in the range of ~50$ for 50.000k examples?), but I find the commercial use of it dubious. I can't really find any explicit statement on the license of data that comes out of chatGPT, or davinci or similar.

If some users here are interested, might be worth the effort to design some proper prompts, all put in some small amount and let GPT do the churning?

quitenominal t1_jdw15ao wrote on March 27, 2023 at 4:42 PM

It's in the terms that you can't use data generated through OpenAI to compete with OpenAI - and I believe they'd be able to argue competition were the trained model to be used commercially.

See section 2.C.iii of https://openai.com/policies/terms-of-use

esquire900 t1_jdwh7v8 wrote on March 27, 2023 at 6:24 PM

Yea I was afraid so, just hadn't found it. Thank you for pointing that out :)

nullbyte420 t1_jdw1g9t wrote on March 27, 2023 at 4:44 PM

And also against the terms of use

Smallpaul t1_jdw0vx9 wrote on March 27, 2023 at 4:40 PM

It seems to me that if a researcher uses OpenAI to generate an open source Instruct dataset, and a different corporation takes that dataset and uses it commercially, they are both legally in the clear unless they collude. The entity that is legally in contact with OpenAI has a legitimately non-commercial purpose and the entity doing the commercial work has no relationship with OpenAI.

ninjasaid13 t1_jdx9x3n wrote on March 27, 2023 at 9:28 PM

can you even copyright a dataset generated by an AI?

Smallpaul t1_jdxf3u3 wrote on March 27, 2023 at 10:03 PM

Probably not legally different than a document you created with a word processor.

learn-deeply t1_jdxgxsx wrote on March 27, 2023 at 10:16 PM

This isn't correct, at least in the US. AI-generated material is not considered copyrightable unless there has been significant human involvement.

https://www.federalregister.gov/documents/2023/03/16/2023-05321/copyright-registration-guidance-works-containing-material-generated-by-artificial-intelligence

Raywuo t1_jeal232 wrote on March 30, 2023 at 4:52 PM

Exactly, they could never appeal or they would be in contradiction with themselves.

wind_dude t1_jdxrcpp wrote on March 27, 2023 at 11:32 PM

>depend on the Alpaca dataset, which was generated from a GPT3 davinci model, and is subject to non-commercial use

Where do you get that? tatsu-lab/stanford_alpaca is apache 2.0, so you can use it for whatever.

for OpenAI

"""

(c) Restrictions. You may not (i) use the Services in a way that infringes, misappropriates or violates any person’s rights; (ii) reverse assemble, reverse compile, decompile, translate or otherwise attempt to discover the source code or underlying components of models, algorithms, and systems of the Services (except to the extent such restrictions are contrary to applicable law); (iii) use output from the Services to develop models that compete with OpenAI; (iv) except as permitted through the API...

"""

So as far as I'm concerned you are allowed to use the generated dataset for commercial purposes...

Only use might be the licensing on the llama models... but you can train another LLM

lazybottle t1_jec8i0c wrote on March 30, 2023 at 11:19 PM

Alpaca is not Apache 2.0

https://huggingface.co/datasets/tatsu-lab/alpaca#licensing-information

> The dataset is available under the Creative Commons NonCommercial (CC BY-NC 4.0).

Edit: I see the source of confusion. https://github.com/tatsu-lab/stanford_alpaca

While the code is released under apache 2.0, the instruct dataset as pointed out by OP is not. One could potentially repro the steps, possibly with human ground truth, and release under a more amenable data license.

wind_dude t1_jec9lb4 wrote on March 30, 2023 at 11:27 PM

Interesting I didn't realise the dataset was on HF with a different license. The dataset (https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json) is also in the code repo which has the apache 2.0 license, so the dataset would be covered by it.

kawin_e t1_jdxz4bh wrote on March 28, 2023 at 12:29 AM

The Stanford Human Preferences dataset (SHP): https://huggingface.co/datasets/stanfordnlp/SHP

It contains pairwise preferences for posts (so tuples (post, response_A, response B)), but you can certainly turn it into an instruction dataset by only considering responses that meet a certain cut-off. I'm currently aware of one academic/industry group that is already doing this.

ninjasaid13 t1_jdy2pqq wrote on March 28, 2023 at 12:56 AM

>one academic/industry group

which one?

abnormal_human t1_jdyxteq wrote on March 28, 2023 at 5:18 AM

Model weights are not currently considered to be copyrightable, and there is no DMCA/RIAA/MPAA machinery providing additional consequences for "pirating" them. At least for the moment, it's not a big risk to use LLaMA/Alpaca models for commercial use so long as you have not made an agreement with Facebook not to do it.

The OpenAI policy is about competing models, and comes from the TOS of using their API. Stanford agreed to that TOS, then released the text (which is again, not copyrightable). Random people downloading that data set aren't party to that agreement or bound by it.

I'm sure that Google, Facebook, Amazon, Netflix, etc will be cautious here, but for a random smaller org, this is a risk/benefit tradeoff, not an absolute.

A person who takes a torrented LLaMA and finetunes it using the Stanford data set didn't necessarily engage in any contracts prohibiting that.

The original leaker of LLaMA weights broke the rules. That's about it. Tsk tsk.

rshah4 t1_jdxhesh wrote on March 27, 2023 at 10:19 PM

It’s possible to pay one of the labeling companies for an instruction dataset. Right now most companies aren’t donating 50k+ datasets to the public, but I expect this will change soon.

ninjasaid13 t1_jdy2mgw wrote on March 28, 2023 at 12:55 AM

>Right now most companies aren’t donating 50k+ datasets to the public, but I expect this will change soon.

see openassistant dataset that will be publicly released on april 15th for open-source.

[deleted] t1_je5fdej wrote on March 29, 2023 at 3:25 PM

[removed]

Raywuo t1_jeadybx wrote on March 30, 2023 at 4:06 PM

Well data generated by GPT cannot be used on a new IA commercially, but what about data generated from an AI that was generated from GPT data? (2 levels of abstraction) haha

[deleted] t1_jeae8re wrote on March 30, 2023 at 4:08 PM

[deleted]