visarga
visarga t1_j6c2fd0 wrote
Reply to comment by Superschlenz in Google not releasing MusicLM by Sieventer
The question is illegal in itself, for simply existing, or illegal to publish, but ok to train on since it has no copyright and does not closely resemble the originals? It could be a technical way to reduce exact copyright infringement.
visarga t1_j6c1rmo wrote
Reply to comment by currentscurrents in [N] OpenAI has 1000s of contractors to fine-tune codex by yazriel0
Humans are harder to scale, and it took billions of years for evolution to get here, with enormous resource and energy usage. A brain trained by evolution is already fit for the environment niche it has to inhabit. But an AI model has none of that, no evolution selecting the internal structure to be optimal. So they have to compensate by learning these things from tons of raw data. We are great at some tasks that relate to our survival, but bad at other tasks, even worse than other animals or AIs - we are not generally intelligent either.
Also, most AIs don't have real time interaction with the world. They only have restricted text interfaces or APIs, no robotic bodies, no way to do interventions to distinguish causal relations from correlations. When an AI has feedback loop from the environment it gets much better at solving tasks.
visarga t1_j6c0o3e wrote
Reply to comment by golongandprosper in Few questions about scalability of chatGPT [D] by besabestin
I very much doubt they do this in real time. The model is responding too fast for that.
They are probably used for RLHF model alignment: to keep it polite, helpful and harmless, and to generate more samples of tasks being solved by vetting our chatGPT interaction logs, or using the model from the console like us to solve tasks, or effectively writing the answers themselves where the model fails.
visarga t1_j6c0e8m wrote
Reply to comment by besabestin in Few questions about scalability of chatGPT [D] by besabestin
They might use a second model to flag abuse, not once every token, but once every line or phrase. Their models are already trained to avoid being abused, but this second model is like insurance in case the main one doesn't work.
visarga t1_j6c01ua wrote
Reply to comment by vivehelpme in Few questions about scalability of chatGPT [D] by besabestin
> But yeah there's really no secret sauce to it.
Of course there is - it's data. They keep their mix of primary training sets with organic text, multi-task fine-tuning, code training and RLHF secret. We know only in general lines what they are doing, but details matter. How much code did they train on? it matters. How many tasks? 1800 like FLAN T5 or much more, like 10,000? We have no idea. Do they reuse the prompts to generate more training data? Possibly. Others don't have their API logs because they had no demo.
visarga t1_j6bzixy wrote
Reply to comment by andreichiffa in Few questions about scalability of chatGPT [D] by besabestin
> without increasing the dataset, bigger model do nothing better
Wrong, bigger models are better than small models even when both are trained on exactly the same data. Bigger models reach the same accuracy using fewer examples. Sometimes using a bigger model is the solution to having too little data.
visarga t1_j6bz9e7 wrote
Reply to comment by binheap in Few questions about scalability of chatGPT [D] by besabestin
Model security is the security of Google's revenues if they release the model. chatGPT is very insecure for their ad clicks, it will crash their income. /s
visarga t1_j6atn4r wrote
Three generations ago, people managed without electricity, fridge, TV and running water. Two generations ago we got TVs and computers but no internet. The last generation grew up with internet. And now kids can have AI. Physical changes dominate in the first part and informational changes in the second.
But some products are mature and excellent, so we can't expect progress there. You can't improve audio quality by higher sampling rate, 44Khz is sufficient. And retina displays are already at the limit of visual acuity. Videos with more than 60-120fps are already too smooth to tell any improvement. Other devices have been great for decades - house appliances, etc. Food can't be improved since we've been optimising at it for too long. Digital content is already post-scarcity, we can find anything, and now we can generate anything. So AGI will have to deliver on top of these things something else, the low hanging fruits have been picked.
visarga t1_j6arwxp wrote
Reply to comment by genshiryoku in Why did 2003 to 2013 feel like more progress than 2013 to 2023? by questionasker577
Generating data through RL like AlphaGo or "Evolution through Large Models" (ELM) seems to show a way out. Not all data is equally useful for the model, for example problem and task solving is more important that raw organic text.
Basically use LLM to generate and another system to evaluate, in order to filter the useful data examples.
visarga t1_j6aeq98 wrote
Reply to comment by mocny-chlapik in [N] OpenAI has 1000s of contractors to fine-tune codex by yazriel0
Scaling model size continues but obtaining more organic data is over, we are at the limit. So the only way is to generate more, but they need humans in the loop to check quality. It's also possible to generate data and verify with math, code execution, simulation or other means. And AnthropicAI showed a pure LLM way to bootstrap more data (RLAIF or Constitutional AI).
I bet OpenAI is just taking the quickest route now. For example, we know that using 1800 tasks in pre-training makes the model generalise to many more tasks at first sight (Flan T5). But OpenAI might have 10,000 tasks to train their model on, hence superior abilities. They also put more effort in RLHF, so they got a more helpful model.
Besides pure organic text, there are other sources - transcribed or described videos is a big one. They released the Whisper model and it's possible they are using it to transcribe massive video datasets. Then there are walled gardens - social networks generate tons of text, not the best quality though. There is also a possibility to massage data collection as game play and get people to buy into providing exactly what they need.
visarga t1_j68zvbi wrote
Reply to comment by ElvinRath in MULTI·ON: an AI Web Co-Pilot powered by ChatGPT that browses the web and automates the tasks by Schneller-als-Licht
> Writing a description of every step instead of just clicking seems like a downgrade to me.
Use a LLM to write the step by step prompts as well. Like SayCan
> We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment.
visarga t1_j68znkm wrote
Reply to comment by manubfr in MULTI·ON: an AI Web Co-Pilot powered by ChatGPT that browses the web and automates the tasks by Schneller-als-Licht
> Automating entire workflows is, to me, the most exciting and realistic outcome of LLMs in the next few years.
They can also use YouTube screen casts - there are millions - to learn about solving tasks with desktop and web apps. YT is a treasure trove of procedural data - how to do things step by step, with commentary.
visarga t1_j68th6d wrote
Reply to comment by GodOfThunder101 in Google not releasing MusicLM by Sieventer
Let me show how you can sidestep copyright.
> In December 2014, the United States Copyright Office stated that works created by a non-human, such as a photograph taken by a monkey, are not copyrightable.
Since AI generated content is public domain, then AI trained on AI generated content is free from any liabilities. This second generation AI cannot replicate any human original work because it never saw them in its training set.
By training on variations we can cleanly separate expression from idea. Copyright only covers expression, not the ideas themselves. But a variation in the same style will capture just the style and not the contents of the original.
So, second generation AI can learn from what is allowed to be learned (ideas) and avoid learning what is protected (expression).
visarga t1_j68rznt wrote
Reply to comment by BigZaddyZ3 in Google not releasing MusicLM by Sieventer
> If people can just use AI to design their own art. There’s no need to ever hire “artists” as we know them.
So naive. The competition will not fire their artists and use AI as well. Guess who will win? They might have so much volume they need to hire more.
visarga t1_j68qfux wrote
Reply to comment by wavefxn22 in Google not releasing MusicLM by Sieventer
Styles, by definition, are broad categories. If they were copyrightable, then the same rule would need to apply to both humans and AI. We can never know when a human has used AI or just looked at AI for inspiration. So we have to assume any human work might have AI in it.
If human works would be exempt from the strict rules AI has to follow what's to stop the big companies to hire people to white wash the style copyrights? What companies need is to license some images in that style. The images can be produced for hire at the lowest price.
visarga t1_j68oh3j wrote
Reply to comment by TwitchTvOmo1 in Google not releasing MusicLM by Sieventer
I work on NLP, simpler tasks like information extraction from forms. My model was based on years of painstaking labelling and architecture tweaking. But last year I put an invoice into GPT-3 and it just spit out the fields in JSON, nicely formatted. No training, just works.
At first I panicked - here we have our own replacement! What do I do now? But now I realise it was not so simple. In order to make it work, you need to massage the input to fit into 2000 tokens, and reserve the rest of 2000 for the response.
I need to check that the extracted fields really do match to the document and are not hallucinated. I have to run it again to extract a few fields that came out empty for some reason. And I have to work on evaluation of prompts, it's not just writing, it has to be tested as well. Now I have so much work ahead of me I don't know what to do first.
I believe most AI adoptions will be similar. They will solve some task but need help, or create new capability and need new development. There is almost no AI that works without human in the loop today, not even chatGPT can be useful until someone vets its output, an certainly not Tesla or Waymo SDCs.
visarga t1_j68nu2s wrote
Reply to comment by SurroundSwimming3494 in Google not releasing MusicLM by Sieventer
AI will make some things easier and create more expectations and work around it.
visarga t1_j67sivp wrote
Reply to comment by TankAttack in [D] Best large language model for Named Entity Extraction? by TankAttack
My task uses sentence pairs, and I have an efficient prompt that makes many pairs in one go. So in 5 hours I managed to generate 230K pairs. Cost $10. I plan to generate millions to "exfiltrate" more domain knowledge for the small and efficient models I am training downstream.
visarga t1_j67q45m wrote
Reply to comment by madmax_br5 in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
The solution is to put more text in the other languages and re-train the tokeniser, it will adapt to the larger corpus by assigning more tokens.
visarga t1_j67pv49 wrote
Reply to comment by HateRedditCantQuitit in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
It's also the fact that content in English dwarfs content in other languages, and languages more similar to English also benefit, but not languages that have different scripts and fewer cognates.
visarga t1_j65kutg wrote
Reply to comment by muchcharles in ⭕ What People Are Missing About Microsoft’s $10B Investment In OpenAI by LesleyFair
Could also be possible that models become 2x or 10x more efficient. GPT-3 was not optimised for cost, just for performance.
visarga t1_j65iwit wrote
I am using GPT-3 for this kind of stuff, and fine-tuning small models on the data.
visarga t1_j5xze4f wrote
Reply to comment by iNstein in Humanity May Reach Singularity Within Just 7 Years, Trend Shows by Shelfrock77
You will rent everything and be at the mercy of your providers.
visarga t1_j5xza2a wrote
Reply to comment by Shelfrock77 in Humanity May Reach Singularity Within Just 7 Years, Trend Shows by Shelfrock77
20 therapeutic SD images and 20 chatGPT prompts per day, what the doctor ordered.
visarga t1_j6c2z5z wrote
Reply to comment by wavefxn22 in Google not releasing MusicLM by Sieventer
I disagree, copyrighting styles is absurd, countless possibilities banned in one go? We'll get to the point where humans fear creating anything because it will inevitably resemble some style somewhere.