Submitted by coconautico t3_11c1hzc in MachineLearning

Hey Reddit,

tl;dr: To democratize the technology behind virtual assistants, we can play a Q&A game to build a collaborative dataset that will enable the creation of culturally and politically unbiased virtual assistants.

As AI becomes more ubiquitous in our lives, we need to democratize it, ensuring that the next generation of virtual assistants, such as chatGPT or BingChat, are not solely controlled by one company, group or country, as it would allow them to skew our reality more easily, by deploying politically and culturally biased assistants at large scale, as we have seen with OpenAI.

While one could argue that over time companies and startups will emerge and create their own alternatives, these could be few, as creating such virtual assistants is not only a matter of massive raw data and computation, but it requires the creation of very specific datasets (many of them created by experts from multiple fields) with the goal of "fine-tuning" Large Language Models (LLMs) into virtual assistants.

Because of this, there is an international collaborative effort to create a public, multilingual, and high-quality dataset through a Q&A game, that will enable the creation of other virtual assistants outside the control of these companies.

At this very moment, we already have more data than OpenAI had when it launched its first version of ChatGPT. However, the current dataset is strongly biased towards Spanish and English speakers, as they are the only ones who have contributed to it so far. Therefore, we need to encourage people from other countries and cultures to play this Q&A game in order to create a truly multilingual dataset with expert knowledge of all kinds, from all over the world. (This would allow the virtual assistant to even answer questions that have not been answered in their language).

For Spanish and English is already a reality. Let's make a reality for other languages too by writing a few of questions/answers in the OpenAssistant game!

Link: https://open-assistant.io/

23

Comments

You must log in or register to comment.

visarga t1_ja2r2fe wrote

Wouldn't it be better if people could donate their interactions with chatGPT, BingChat and other models? Make a scraping extension, it should collect chat logs and anonymise them. Then you got a diverse distribution of real life tasks.

I suspect this is the reason OpenAI and Bing offered their models for free to the public - to find the real distribution of tasks people want to solve with AI bots.

9

avocadoughnut t1_ja35pg6 wrote

There's risk of breaking OpenAI TOS by training on their models. It's a hard no for this project to ensure legal safety.

8

coconautico OP t1_ja3nvs7 wrote

I have manually copy-pasted a few interesting questions (i.e, my input) that I have asked chatGPT previously, that encouraged lateral thinking or required specialized knowledge.

However, I'm not so sure it would a good idea to load thousands of questions indiscriminately, because just as we wouldn't express a question on Reddit in the same way we would in person, when we ask a question to chatGPT (or Google), we slightly modify the way we talk by taking into account the weaknesses of the system. And given that we are looking for a high-quality dataset of natural conversations, I don't think this would be a very good strategy in the short term.

Moreover, we also have to consider that the project prioritizes quality above all else, and unless the number of volunteers ranking questions/replies increases considerably, the "ratio of trees to ready exported" wouldn't increase much either.

3

LetterRip t1_ja3rzqk wrote

> I have manually copy-pasted a few interesting questions that I asked chatGPT and encouraged lateral thinking or required specialized knowledge. > >

Don't do that - it violates ChatGPT's TOS which could result in a lawsuit against the model developers.

0

coconautico OP t1_ja3ujgs wrote

According to OpenAI's terms of service, I'm the owner of the input (i.e., my question), which implies that they can use, modify, and distribute my input for the purpose of operating and improving the ChatGPT system, but they can't do anything to prevent me from using my data in other systems.
Link: https://openai.com/terms/

6

LetterRip t1_ja4d12c wrote

It appears they have changed the ToS. It used to restrict usage of output.

2

sebzim4500 t1_ja87cym wrote

> You may not [...] (iii) use the Services to develop foundation models or other large scale models that compete with OpenAI

1

coconautico OP t1_ja8abnh wrote

I can't use the output of ChatGPT to train other systems, but I can use my input however I want because, according to the TOS, I'm the owner of it.

3

sebzim4500 t1_ja8agwp wrote

Are you using the output of ChatGPT to determine which inputs you copy across and which ones you don't? If not, I agree that you are probably in the clear. Otherwise idk.

1

coconautico OP t1_ja8dbew wrote

No, I don't, because even if chatGPT could answer my question correctly, that doesn't mean that another assistant could.

Therefore, when I come up with a question that, from my point of view could be challenging to answer by a virtual assistant, and regardless of whether I have searched Google/Reddit/StackOverflow/ChatGPT/... for the answer, I end up typing it on OpenAssistant, (again, just my question).

2

firejak308 t1_ja16y0h wrote

My main concern with this is how the "Reply as Assistant" texts are generated. That task is orders of magnitude more difficult than labeling an existing reply/prompt or coming up with a new prompt, because it often requires doing background research about the question and summarizing it effectively. If I were to actually try to fill out one of the Reply as Assistant tasks, I would much rather just copy-paste the Google Knowledge Panel or the Wikipedia summary or the ChatGPT output. How do we know that people aren't doing those kinds of things, which could introduce plagiarism concerns?

5

coconautico OP t1_ja1gd4g wrote

Indeed! Many of them are just copying and pasting answers out of laziness or because they don't know they're not supposed to. But you know what? That's okay! It doesn't matter. And it's all thanks to the magic of large-scale ranking! Let me explain.

If we had a LLM that just "reads" text indiscriminately, we would end up with a model that could hardly be better than the average human (...as the average human is just, the average). However, the moment we have multiple answers per question, and hundreds of people upvoting/downvoting, and ranking them relatively according to their quality (...and a few moderators like on reddit), we end up with a set of fairly high-quality question-answer pairs that are better than the average human answer, in the same way that a set of weak classifiers can result in a strong classifier (i.e. AdaBoost).

10

topcodemangler t1_ja1pm3i wrote

Question - how much data you already have and how much more do you need?

3

currentscurrents t1_ja1vjfi wrote

It looks like they currently have ~50k responses, which is around the same amount used to train the reward model for ChatGPT.

More data is always better though.

8

Taenk t1_ja4jjxn wrote

Especially having conversation trees in multiple languages is very valuable.

1

Taenk t1_ja4jcxm wrote

Subreddit: /r/openassistant

2

photosandphotons t1_ja1ibvm wrote

Just so I understand, is this supposed to be any different than ChatGPT? Or is it just that it’s an open source implementation?

1

coconautico OP t1_ja1kdu6 wrote

Neither. OpenAssistant is the iniciative to build an open-source version of chatGPT that will fit in a consumer GPU.

However, the goal of this website is to collaborative create a specific type of dataset needed to transform a LLM such as GPT, OPT, Galactica, LLaMA,.. into a virtual assistant to which we can talk to, like chatGPT.

7