Comments

You must log in or register to comment.

ThatInternetGuy t1_jcs253z wrote

You know ChatGPT and GPT4 licenses forbid using their output data for training competing AI models. What Stanford did was to show proof of concept for their paper, not to open-source the model, at all.

25

frownGuy12 t1_jcsfnh7 wrote

If OpenAI wants people to respect their IP they should take the word “open” out of their name. They scraped our data to train their models after all, it’s not like OpenAI themselves aren’t pushing the boundaries of what’s acceptable when it comes to copyright law.

Legally it’s questionable, but ethically speaking I think it’s a fine idea.

53

throwaway957280 t1_jcsjj07 wrote

Is OpenAI actually legally allowed to do that? How is using their model for training different from training on copyrighted data which all these models do?

19

Anjz t1_jcsktsf wrote

It's probably untested in courts, there's so many loopholes and variables too, what's considered a competing AI model? Companies usually just spew a bunch of stuff in their terms of use, some of which have no legal basis.

19

kex t1_jcsm7kh wrote

I'd say enjoy it while it lasts, at the very least

6

hughperman t1_jcswzfh wrote

Train a model that's designated as non-competing but open, then train another model from the output of that that's competing.

4

starstruckmon t1_jct0s11 wrote

They are. It's less to do with copyright and more to do with the fact that you signed the T&C before using their system ( and then broke ). It's simmilar to the LinkedIn data scraping case where the court ruled that it wasn't illegal for them to scrape ( nor did it violate copyright ) but they still got in trouble ( and had to settle ) because of violating the T&C.

One way around this is to have two parties, one generating and publishing the dataset ( doesn't violate T&C ) and another independant party ( who didn't sign the T&C ) fine-tuning a model on the dataset.

6

RoyalCities t1_jctcu1m wrote

Couldnt it be possible to set up a large community Q/A repositiry then? Just crowdsource whatever it outputs and document collectively.

2

bitchslayer78 t1_jcsz4s3 wrote

No they aren’t , they have no claim on transformers that would be google brain , but you don’t see alphabet throwing a sissy fit

1

yaosio t1_jcsob5z wrote

The output of AI can't be copyrighted so OpenAI has no say in what somebody does with the output.

10

lxe t1_jcsqk7t wrote

Copyright and license terms are different things.

3

yaosio t1_jcsqxwf wrote

If doesn't matter what the license terms say if it can't be enforced.

9

Uptown-Dog t1_jct32n7 wrote

I think you'd be dismayed at how easy it is to enforce these things when you have OpenAI money.

1

objectdisorienting t1_jcsu3xk wrote

Will be interesting to see where lawmakers and courts ultimately land on this, but the current status quo is that AI generated text and images (or any other works) cannot be copyrighted. In other words for now all output is public domain and OpenAI can kick rocks on this. A TOS violation just means you might get banned from using their service lol.

1

VertexMachine t1_jct3b51 wrote

It's most likely enforceable, but even if it's not they can simply ban OP for doing that. if OP is using their API in any way that's important to him, it's something to consider.

1

Either-Job-341 t1_jcrhysc wrote

OpenAI API costs based on how many tokens you use, isn't that the case? Afaik, the fixed price (20$) is for the case when you're using it via UI (probably max one session).

24

A1-Delta t1_jcrpd05 wrote

Interesting project! I’ve seen many suggest that the training data for transfer learning might actually be the biggest thing holding Alpaca back from a ChatGPT like experience. In other words, that although the OpenAI model allows for the creation of a lot of training data, that data might include a lot of low quality pairs that in an ideal world wouldn’t be included. Do you have any plan to increase the quality of your dataset in addition to the size of it?

I hear your concern about the LLaMA license. It might be bad advice, but personally I wouldn’t worry about it. This is a very popular model people are using for all sorts of things. The chance they are going to come after you seems to me to be small and my understanding is that it’s sort of uncharted legal ground once you’ve done significant fine tuning. That being said, I’m not a lawyer.

LLaMA is a very powerful model and I would hate for you to put all this effort into creating something that ends up being limited and not clearly better than Alpaca simply because of license fears. If I were you though, I’d go with the 13B version. Still small enough to run on many high end consumer GPUs after quantization while providing significantly better baseline performance than the 7B version.

20

starstruckmon t1_jct0dxy wrote

Just publish the diff. between the original model and the finetuned model. That's what a lot of people are doing to avoid any license issues.

2

MysteryInc152 t1_jcrnqc8 wrote

You can try training chatGLM. 6b parameters and initially trained on 1T English/Chinese Tokens. Also completely open source. However, it's already been fine tuned and had RLHF but that was optimized for Chinese Q/A. Could use some English work,

Another option is RWKV. There are 7b and 14b models(I would go with the 14b, it's the better of the two) fine tuned to a context length of 8196 tokens. He plans on increasing context further too.

17

Craiglbl t1_jcrxjy2 wrote

ChatGLM is really good. I sometimes have a hard time distinguishing its Chinese outputs from those of chatgpt.

Sadly its English could use some improvement as it usually use Chinese adjectives when similar words are lacking in English.

8

cthorrez t1_jcvhg41 wrote

RWKV is recurrent right? Why is it token limited?

1

noobgolang t1_jcrvlfl wrote

someone needs to take the plunge and release all of this altogether to the wild rather than this closed source nature

6

ReasonablyBadass t1_jcs32ea wrote

Careful. That MIT license won't work, I think, thanks to ClosedAIs licences

6

ninjasaid13 t1_jcsth4w wrote

>Careful. That MIT license won't work, I think, thanks to ClosedAIs licences

Generally, copyright requires human authorship. If the output of an AI model is solely generated by a machine without human input, it may not be eligible for copyright protection and fall under public domain.

4

ReasonablyBadass t1_jcsu1yv wrote

Not sure how much this is established law.

Anyway, Alpaca says so themselves on their website: https://crfm.stanford.edu/2023/03/13/alpaca.html

1

ninjasaid13 t1_jcsv2oi wrote

It's what the copyright office said according to that midjourney comic that was being registered for copyright.

Since it was created by an AI the output cannot be registered for copyright and licensing doesn't hold power on something that's in public domain.

2

[deleted] OP t1_jcsjd9y wrote

For those who wish for LLaMA to become truly open source, please vote on this:

https://github.com/facebookresearch/llama/pull/184

5

VertexMachine t1_jct3jwb wrote

and what does voting there do to make it open source? Lecun already knows that majority of people don't like this licensing as people were tweeting that at him since llama release...

0

[deleted] OP t1_jct6a1x wrote

Provide a clear and formal way for the community to express its opinion. You know, as opposed to tweeting at one person who does not have absolute control over Meta AI. Notable people have brought attention to that pull request and it is currently gaining traction.

2

Either-Job-341 t1_jcrhe3r wrote

RemindMe! 2 days

4

Euphoric-Escape-9492 t1_jcvdk04 wrote

very sad considering his account was deleted. I hope he still finds a way to post his results (if he decides to still go through with the idea)

3

RoyalCities t1_jcrxlvr wrote

I was talking to GPT 4 about this and it said that it seems plausible and can dramatically bring down costs.

It called it "knowledge distillation"

It also mentioned that if we had access to the weights from open ai you can use a process called model compression to scale down the hardware and put it on less powerful gpus or distributed gpus (like how render farms work)

This also explains why open ai is so cagey on releasing weights - the initial training cost is where the money sink is but once weights are out their is ways to make it run on cheaper hardware.

But Im wondering does this mean the smaller model can ONLY respond to the questions your generating or will it have latent knowledge outside of just the knowledge transfer? Like would say the smaller model thats trained off this approach also be able to answer questions on topics that are "restricted" in open ais view that you couldnt ask it or do you absolutely must need to get an initial answer for such restricted content for it to be able to produce a responce?

Talking about things like writing malicious code or what not. I dont plan on doing that obviously but Im curious on if this means that these smaller models will basically be totally unrestricted now or if its just trained on say tons of python code it can just create said malicious code from scratch without actually being exposed with examples of "how" to make it (since it has a greater knowledge of the ubderlying principals of python)

Edit: Okay guess it can per GPT 4.

Damn these things are fascinating.

>Yes, the same concerns can apply to a smaller model being trained from a larger one via knowledge distillation. Knowledge distillation is a technique where the smaller model learns to mimic the larger model's behavior by training on a dataset generated using the larger model's outputs. The smaller model effectively learns from the larger model's knowledge and understanding of language patterns and concepts.

>As a result, the smaller model can also gain latent knowledge about various topics and domains, even if it hasn't been explicitly exposed to specific examples during training. This means that the smaller model could potentially generate undesirable content based on its understanding of the relationships between words and concepts, similar to the larger model.

3

Smallpaul t1_jcsah9r wrote

I think the new model gets most of its knowledge from its original model and the training is mostly about how to act like a RLHF model.

3

starstruckmon t1_jct06xj wrote

  • There's a already a couple high quality instruction datasets/compilations like FLAN that I think should also be mixed in.

  • Be sure to check the generated dataset for issues. Might require some cleanup like the original did.

3

Stock-Nebula2185 t1_jcsuzb5 wrote

You can also query Codex for free. It might not be as good at ChatGPT, but perhaps still worth trying?

2

raduqq t1_jcslcbz wrote

I thought their ToS doesn't allow you to train another model on the output of their ChatGPT or GPT-4 or other models.

1

FaceDeer t1_jcsot55 wrote

All these weird restrictions and regulations seem pretty squirrelly to me.

Maybe this could be "laundered" by doing two separate projects. Have one project gather the 2 million question/response interactions into a big archive, which is then released publicly. Then some other project comes along and uses it for training, without directly interacting with ChatGPT itself.

I'm sure this won't really stop a lawsuit, but the more complicated it can be made for OpenAI to pursue it the less likely they are to go ahead.

5

asraniel t1_jcsr22m wrote

wo how does that work? soon a good chunk of the internet will be text written by gpt (including wikipedia). does that mean going forward you cant legally use the internet as a datasource to train a llm?

5

Long19980 t1_jcsm4ni wrote

Can I see your python script? How did you balance your programming language data between the various languages?

1

assimil8or t1_jcsnwuh wrote

Would UL2 be a good basis?

1

Seromelhor t1_jcsov3a wrote

You can use NLLB from Facebook to translate the sentences from English to more than 200 other languages. That would be interesting.

1

lxe t1_jcsqmdi wrote

You should try fine tuning openchatkit — it’s Apache 2 licensed afaik. Or GPT-NEOX-20B if you have the hardware.

1

hapliniste t1_jcsxpna wrote

Nice 👍 good project, I'm impatient to see the result. It would be great to make a torrent of the dataset to avoid unnecessary costs in the future too

1

Euphoric-Escape-9492 t1_jcvcyoj wrote

very sad the post was deleted and his account was deleted. I wonder if he did this intentionally or not.

1

wywywywy t1_jct2wjz wrote

Are you doing a Lora or full weights?

> I wanted to train Meta's LLaMA model on this data, but considering their license, I'm not sure if that is the best way. Suggestions will be appreciated.

If we ignore OpenAI's licence, is it ok to perhaps ignore Meta's licence as well? Or is that going too far

> The trained model will be open source, under MIT License.

Is the dataset going to be open source as well? So that other people can use it to train other models.

1

baffo32 t1_jcsy5mb wrote

Maybe set up the training code so different foundation models can be plugged in for finetuning and the it’s just compute if somebody wants a different starting model.

Note there are free interfaces to these models such as https://spellbook.scale.com/ . Also note there is a lot of data collected out there already.

0