ThatInternetGuy t1_jcs253z wrote on March 19, 2023 at 2:57 AM

You know ChatGPT and GPT4 licenses forbid using their output data for training competing AI models. What Stanford did was to show proof of concept for their paper, not to open-source the model, at all.

frownGuy12 t1_jcsfnh7 wrote on March 19, 2023 at 5:07 AM

If OpenAI wants people to respect their IP they should take the word “open” out of their name. They scraped our data to train their models after all, it’s not like OpenAI themselves aren’t pushing the boundaries of what’s acceptable when it comes to copyright law.

Legally it’s questionable, but ethically speaking I think it’s a fine idea.

throwaway957280 t1_jcsjj07 wrote on March 19, 2023 at 5:53 AM

Is OpenAI actually legally allowed to do that? How is using their model for training different from training on copyrighted data which all these models do?

Anjz t1_jcsktsf wrote on March 19, 2023 at 6:09 AM

It's probably untested in courts, there's so many loopholes and variables too, what's considered a competing AI model? Companies usually just spew a bunch of stuff in their terms of use, some of which have no legal basis.

kex t1_jcsm7kh wrote on March 19, 2023 at 6:28 AM

I'd say enjoy it while it lasts, at the very least

hughperman t1_jcswzfh wrote on March 19, 2023 at 9:02 AM

Train a model that's designated as non-competing but open, then train another model from the output of that that's competing.

starstruckmon t1_jct0s11 wrote on March 19, 2023 at 9:57 AM

They are. It's less to do with copyright and more to do with the fact that you signed the T&C before using their system ( and then broke ). It's simmilar to the LinkedIn data scraping case where the court ruled that it wasn't illegal for them to scrape ( nor did it violate copyright ) but they still got in trouble ( and had to settle ) because of violating the T&C.

One way around this is to have two parties, one generating and publishing the dataset ( doesn't violate T&C ) and another independant party ( who didn't sign the T&C ) fine-tuning a model on the dataset.

RoyalCities t1_jctcu1m wrote on March 19, 2023 at 12:29 PM

Couldnt it be possible to set up a large community Q/A repositiry then? Just crowdsource whatever it outputs and document collectively.

[deleted] OP t1_jd0nazd wrote on March 20, 2023 at 11:44 PM

[removed]

BraianP t1_jdfjbq4 wrote on March 24, 2023 at 12:47 AM

so, open assistant?

bitchslayer78 t1_jcsz4s3 wrote on March 19, 2023 at 9:34 AM

No they aren’t , they have no claim on transformers that would be google brain , but you don’t see alphabet throwing a sissy fit

Long19980 t1_jcsllwx wrote on March 19, 2023 at 6:20 AM

They can go cry about it.

yaosio t1_jcsob5z wrote on March 19, 2023 at 6:56 AM

The output of AI can't be copyrighted so OpenAI has no say in what somebody does with the output.

lxe t1_jcsqk7t wrote on March 19, 2023 at 7:29 AM

Copyright and license terms are different things.

yaosio t1_jcsqxwf wrote on March 19, 2023 at 7:35 AM

If doesn't matter what the license terms say if it can't be enforced.

Uptown-Dog t1_jct32n7 wrote on March 19, 2023 at 10:30 AM

I think you'd be dismayed at how easy it is to enforce these things when you have OpenAI money.

starstruckmon t1_jct0v0k wrote on March 19, 2023 at 9:58 AM

It's not about copyright

https://www.reddit.com/r/MachineLearning/comments/11v4h5z/-/jct0s11

objectdisorienting t1_jcsu3xk wrote on March 19, 2023 at 8:21 AM

Will be interesting to see where lawmakers and courts ultimately land on this, but the current status quo is that AI generated text and images (or any other works) cannot be copyrighted. In other words for now all output is public domain and OpenAI can kick rocks on this. A TOS violation just means you might get banned from using their service lol.

VertexMachine t1_jct3b51 wrote on March 19, 2023 at 10:33 AM

It's most likely enforceable, but even if it's not they can simply ban OP for doing that. if OP is using their API in any way that's important to him, it's something to consider.

Either-Job-341 t1_jcrhysc wrote on March 19, 2023 at 12:16 AM

OpenAI API costs based on how many tokens you use, isn't that the case? Afaik, the fixed price (20$) is for the case when you're using it via UI (probably max one session).

[deleted] OP t1_jcrnbs3 wrote on March 19, 2023 at 12:58 AM

[deleted]

Damitrix t1_jcsz5j2 wrote on March 19, 2023 at 9:34 AM

I thought they didn't like ppl using the UI as an API? Hopefully you don't get banned

starstruckmon t1_jcszqtg wrote on March 19, 2023 at 9:42 AM

He's using the actual API.

MysteryInc152 t1_jcro0q5 wrote on March 19, 2023 at 1:03 AM

He's talking about the playground which is per token https://platform.openai.com/playground

A1-Delta t1_jcrpd05 wrote on March 19, 2023 at 1:13 AM

Interesting project! I’ve seen many suggest that the training data for transfer learning might actually be the biggest thing holding Alpaca back from a ChatGPT like experience. In other words, that although the OpenAI model allows for the creation of a lot of training data, that data might include a lot of low quality pairs that in an ideal world wouldn’t be included. Do you have any plan to increase the quality of your dataset in addition to the size of it?

I hear your concern about the LLaMA license. It might be bad advice, but personally I wouldn’t worry about it. This is a very popular model people are using for all sorts of things. The chance they are going to come after you seems to me to be small and my understanding is that it’s sort of uncharted legal ground once you’ve done significant fine tuning. That being said, I’m not a lawyer.

LLaMA is a very powerful model and I would hate for you to put all this effort into creating something that ends up being limited and not clearly better than Alpaca simply because of license fears. If I were you though, I’d go with the 13B version. Still small enough to run on many high end consumer GPUs after quantization while providing significantly better baseline performance than the 7B version.

starstruckmon t1_jct0dxy wrote on March 19, 2023 at 9:51 AM

Just publish the diff. between the original model and the finetuned model. That's what a lot of people are doing to avoid any license issues.

MysteryInc152 t1_jcrnqc8 wrote on March 19, 2023 at 1:01 AM

You can try training chatGLM. 6b parameters and initially trained on 1T English/Chinese Tokens. Also completely open source. However, it's already been fine tuned and had RLHF but that was optimized for Chinese Q/A. Could use some English work,

Another option is RWKV. There are 7b and 14b models(I would go with the 14b, it's the better of the two) fine tuned to a context length of 8196 tokens. He plans on increasing context further too.

Craiglbl t1_jcrxjy2 wrote on March 19, 2023 at 2:19 AM

ChatGLM is really good. I sometimes have a hard time distinguishing its Chinese outputs from those of chatgpt.

Sadly its English could use some improvement as it usually use Chinese adjectives when similar words are lacking in English.

cthorrez t1_jcvhg41 wrote on March 19, 2023 at 9:38 PM

RWKV is recurrent right? Why is it token limited?

kross00 t1_jcre2hi wrote on March 18, 2023 at 11:47 PM

I'm a newbie... but maybe take a look at this model: https://github.com/BlinkDL/RWKV-LM

noobgolang t1_jcrvlfl wrote on March 19, 2023 at 2:03 AM

someone needs to take the plunge and release all of this altogether to the wild rather than this closed source nature

ReasonablyBadass t1_jcs32ea wrote on March 19, 2023 at 3:05 AM

Careful. That MIT license won't work, I think, thanks to ClosedAIs licences

ninjasaid13 t1_jcsth4w wrote on March 19, 2023 at 8:12 AM

>Careful. That MIT license won't work, I think, thanks to ClosedAIs licences

Generally, copyright requires human authorship. If the output of an AI model is solely generated by a machine without human input, it may not be eligible for copyright protection and fall under public domain.

ReasonablyBadass t1_jcsu1yv wrote on March 19, 2023 at 8:20 AM

Not sure how much this is established law.

Anyway, Alpaca says so themselves on their website: https://crfm.stanford.edu/2023/03/13/alpaca.html

ninjasaid13 t1_jcsv2oi wrote on March 19, 2023 at 8:35 AM

It's what the copyright office said according to that midjourney comic that was being registered for copyright.

Since it was created by an AI the output cannot be registered for copyright and licensing doesn't hold power on something that's in public domain.

[deleted] OP t1_jcsjd9y wrote on March 19, 2023 at 5:51 AM

For those who wish for LLaMA to become truly open source, please vote on this:

https://github.com/facebookresearch/llama/pull/184

philipgutjahr t1_jct7e53 wrote on March 19, 2023 at 11:28 AM

LGTM

VertexMachine t1_jct3jwb wrote on March 19, 2023 at 10:37 AM

and what does voting there do to make it open source? Lecun already knows that majority of people don't like this licensing as people were tweeting that at him since llama release...

[deleted] OP t1_jct6a1x wrote on March 19, 2023 at 11:14 AM

Provide a clear and formal way for the community to express its opinion. You know, as opposed to tweeting at one person who does not have absolute control over Meta AI. Notable people have brought attention to that pull request and it is currently gaining traction.

Either-Job-341 t1_jcrhe3r wrote on March 19, 2023 at 12:12 AM

RemindMe! 2 days

Euphoric-Escape-9492 t1_jcvdk04 wrote on March 19, 2023 at 9:11 PM

very sad considering his account was deleted. I hope he still finds a way to post his results (if he decides to still go through with the idea)

RemindMeBot t1_jcrhfug wrote on March 19, 2023 at 12:12 AM

I will be messaging you in 2 days on 2023-03-21 00:12:32 UTC to remind you of this link

10 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

yehiaserag t1_jcru1ty wrote on March 19, 2023 at 1:51 AM

Do you have a repo, a website, anywhere we can follow the progress of this?

yehiaserag t1_jcrz2qt wrote on March 19, 2023 at 2:31 AM

Just saw https://www.reddit.com/r/MachineLearning/comments/11v261n/p_d_datasetgpt_a_commandline_tool_to_generate/

seems very similar

RoyalCities t1_jcrxlvr wrote on March 19, 2023 at 2:19 AM

I was talking to GPT 4 about this and it said that it seems plausible and can dramatically bring down costs.

It called it "knowledge distillation"

It also mentioned that if we had access to the weights from open ai you can use a process called model compression to scale down the hardware and put it on less powerful gpus or distributed gpus (like how render farms work)

This also explains why open ai is so cagey on releasing weights - the initial training cost is where the money sink is but once weights are out their is ways to make it run on cheaper hardware.

But Im wondering does this mean the smaller model can ONLY respond to the questions your generating or will it have latent knowledge outside of just the knowledge transfer? Like would say the smaller model thats trained off this approach also be able to answer questions on topics that are "restricted" in open ais view that you couldnt ask it or do you absolutely must need to get an initial answer for such restricted content for it to be able to produce a responce?

Talking about things like writing malicious code or what not. I dont plan on doing that obviously but Im curious on if this means that these smaller models will basically be totally unrestricted now or if its just trained on say tons of python code it can just create said malicious code from scratch without actually being exposed with examples of "how" to make it (since it has a greater knowledge of the ubderlying principals of python)

Edit: Okay guess it can per GPT 4.

Damn these things are fascinating.

>Yes, the same concerns can apply to a smaller model being trained from a larger one via knowledge distillation. Knowledge distillation is a technique where the smaller model learns to mimic the larger model's behavior by training on a dataset generated using the larger model's outputs. The smaller model effectively learns from the larger model's knowledge and understanding of language patterns and concepts.

>As a result, the smaller model can also gain latent knowledge about various topics and domains, even if it hasn't been explicitly exposed to specific examples during training. This means that the smaller model could potentially generate undesirable content based on its understanding of the relationships between words and concepts, similar to the larger model.

Smallpaul t1_jcsah9r wrote on March 19, 2023 at 4:13 AM

I think the new model gets most of its knowledge from its original model and the training is mostly about how to act like a RLHF model.

philipgutjahr t1_jctbs35 wrote on March 19, 2023 at 12:18 PM

which can make a huge difference: GPT-3 + RLHF = Chat-GPT

starstruckmon t1_jct06xj wrote on March 19, 2023 at 9:49 AM

There's a already a couple high quality instruction datasets/compilations like FLAN that I think should also be mixed in.
Be sure to check the generated dataset for issues. Might require some cleanup like the original did.

ninjasaid13 t1_jcssyvo wrote on March 19, 2023 at 8:04 AM

can you finetune OpenAssistant's model? https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b I hear their pythia language model isn't very good or maybe gpt-j which is more consumer grade friendly.

Stock-Nebula2185 t1_jcsuzb5 wrote on March 19, 2023 at 8:33 AM

You can also query Codex for free. It might not be as good at ChatGPT, but perhaps still worth trying?

raduqq t1_jcslcbz wrote on March 19, 2023 at 6:16 AM

I thought their ToS doesn't allow you to train another model on the output of their ChatGPT or GPT-4 or other models.

FaceDeer t1_jcsot55 wrote on March 19, 2023 at 7:03 AM

All these weird restrictions and regulations seem pretty squirrelly to me.

Maybe this could be "laundered" by doing two separate projects. Have one project gather the 2 million question/response interactions into a big archive, which is then released publicly. Then some other project comes along and uses it for training, without directly interacting with ChatGPT itself.

I'm sure this won't really stop a lawsuit, but the more complicated it can be made for OpenAI to pursue it the less likely they are to go ahead.

asraniel t1_jcsr22m wrote on March 19, 2023 at 7:36 AM

wo how does that work? soon a good chunk of the internet will be text written by gpt (including wikipedia). does that mean going forward you cant legally use the internet as a datasource to train a llm?

Long19980 t1_jcsm4ni wrote on March 19, 2023 at 6:27 AM

Can I see your python script? How did you balance your programming language data between the various languages?

assimil8or t1_jcsnwuh wrote on March 19, 2023 at 6:51 AM

Would UL2 be a good basis?

Seromelhor t1_jcsov3a wrote on March 19, 2023 at 7:04 AM

You can use NLLB from Facebook to translate the sentences from English to more than 200 other languages. That would be interesting.

lxe t1_jcsqmdi wrote on March 19, 2023 at 7:30 AM

You should try fine tuning openchatkit — it’s Apache 2 licensed afaik. Or GPT-NEOX-20B if you have the hardware.

hapliniste t1_jcsxpna wrote on March 19, 2023 at 9:12 AM

Nice 👍 good project, I'm impatient to see the result. It would be great to make a torrent of the dataset to avoid unnecessary costs in the future too

Euphoric-Escape-9492 t1_jcvcyoj wrote on March 19, 2023 at 9:07 PM

very sad the post was deleted and his account was deleted. I wonder if he did this intentionally or not.

wywywywy t1_jct2wjz wrote on March 19, 2023 at 10:27 AM

Are you doing a Lora or full weights?

> I wanted to train Meta's LLaMA model on this data, but considering their license, I'm not sure if that is the best way. Suggestions will be appreciated.

If we ignore OpenAI's licence, is it ok to perhaps ignore Meta's licence as well? Or is that going too far

> The trained model will be open source, under MIT License.

Is the dataset going to be open source as well? So that other people can use it to train other models.

baffo32 t1_jcsy5mb wrote on March 19, 2023 at 9:19 AM

Maybe set up the training code so different foundation models can be plugged in for finetuning and the it’s just compute if somebody wants a different starting model.

Note there are free interfaces to these models such as https://spellbook.scale.com/ . Also note there is a lot of data collected out there already.

[deleted] OP t1_jcst826 wrote on March 19, 2023 at 8:08 AM

[removed]

Comments