Submitted by elonmusk12345_ t3_ydjtz5 in singularity
Where does the model accuracy increase due to increasing the model's parameters stop? Is AGI possible by just scaling models with the current transformer architecture?
Submitted by elonmusk12345_ t3_ydjtz5 in singularity
Where does the model accuracy increase due to increasing the model's parameters stop? Is AGI possible by just scaling models with the current transformer architecture?
Indeed. The few tweaks I’d say are continual learning and longer short term memory. Both are active research sub fields. All that’s left to do is scale model size which I consider to be way more than data. Human beings understand basic concepts and don’t need to read the entire internet for that. Because we have evolved bigger brains.
>Human beings understand basic concepts and don’t need to read the entire internet for that.
We have years of training data via multiple high input channels before we reach that level though.
There is a convincing argument that many of the most first principles fundamental things we understand about the world are engrained in us at birth.
A good article: https://www.scientificamerican.com/article/born-ready-babies-are-prewired-to-perceive-the-world/
Its actually already stopping, the engineering challenges are getting too big (trends predict 5-10 trillion parameter dense models by now, bet your ass they don't exist), the data available is getting too few, and the other ways to increase performance are way too easy and way too cheap to not focus on.
"trends predict 5-10 trillion parameter dense models by now, bet your ass they don't exist), the data available is getting too few".
I beg to differ. Indeed, we should expect to see 10 to 20 trillion parameter models this year. Based on industry movements, I'm expecting Meta or Open AI to produce such a model by the end of this year, if not Q1 2023. We don't have enough data for chinchilla compute optimal models. Deep mind scaling laws are flawed in a number of fundamental ways. One of which is that as that sample efficiency, generality and intelligence increases in scale. Large vanilla models require less data in order to achieve better performance. We can train multi trillion parameter dense models with the same or better yet, less data that it took to train gpt 3. It is certainly possible with massive compute clusters running on thousands of A100 gpus to train such a model. Which is exactly what is being done right now. Cheap methods are being focused on right now are a temporary crutch which I'm projected will be put away once firms are able to adopt new gpus such as the H100s.
Wowowow you're seriously questioning the scaling laws of deepmind and going back to the OpenAI ones, which have been demonstrated to be false?
Chain of thought prompting, self consistency, reinforcement learning from human feedback, and data scaling, that's been driving LLM performance lately, noticeably more than scale has. (whilst being significantly cheaper).
Why do you expect such a jump when the industry has been stuck at half a trillion for the past year? All previous jumps were smaller and cost significantly less.
>Why do you expect such a jump when the industry has been stuck at half a trillion for the past year? All previous jumps were smaller and cost significantly less.
A combination of software and hardware improvements being currently worked on using Nvidia GPUs. https://azure.microsoft.com/en-us/blog/azure-empowers-easytouse-highperformance-and-hyperscale-model-training-using-deepspeed/
With regard to Chinchilla, I don't think they disproved anything. See my comment history if you care enough. I've debated quite extensively on this topic.
All I see is comparisons to humans that are by and large unfounded.
It’s simply going to be both scenarios in 2023, quantity and quality, synthetic data variations from existing corpuses with better training distributions (pseudo-sparcity) on optimized hardware. Maybe even some novel chips like photon or analog later next year. It’s like cpus 20 years ago, optimizations all around!
I'm curious what trends have been predicting 5-10 trillion parameter models?
And additionally, more recent work has fundamentally increased the value of scaling.
https://twitter.com/YiTayML/status/1583514524836978689?t=Xxm_NYIQvGr5743ZdQzaqA&s=19
You can see that here for example.
But I have heard that finding data is the hard part now, and inference speeds on models in the trillions are going to restrict it's capabilities - but there is a lot of great work being done on inference speed ups.
The size of language models has been growing exponentially. We should expect 100 trillion parameter dense models by next year. https://i0.wp.com/silvertonconsulting.com/wp-content/uploads/2021/04/Screen-Shot-2021-04-15-at-3.18.31-PM.png?ssl=1
I think that is possible once firms begin using h100 gpus.
With H100 the training time optimistically only improves a factor of 9. Not nearly enough to breach the 200x gap between the current largest model and 100 trillion parameter model, and thats in parameter scaling alone, ignoring data scaling. PaLM training took 1200 hours on 6144 tpu v4 chips, and an additional 336 hours on 3072 tpu v4 chips. A 100 trillion parameter model would literally be too big to train before the year 2023 comes to an end.
100 billion parameter models seemed impossible too, back when the size of neural networks was a few million. I'm expecting 10 trillion parameters to be human level AGI.
That wasn't 1 year before the prediction of a hundred billion parameters though. Im not doubting that they'll come, im doubting the timeline.
Interested in why you think a 10 trillion parameter would be human level AGI.
Artificial neural networks are sufficient mathematical representations of biological cortices. there a huge amount of evidence that concludes this is the case. All that’s left to do is compare human and animal brains to our Ai models. The human brain doesn’t use all 100 trillion parameters on any one task. In fact the brain is divided into regions that allocate compute resources to vision, language, audio etc.. Not even half our brain devotes that many resources to one major region. The upper bound would be 50 trillion parameters. 1 trillion is too small. There aren’t 100 different major cortical regions. There are 10 . All working on the same architecture but processing different modalities. Conservatively 10 trillion parameters are allocated to each major region. Lets take a language model with 10 trillion weights. At that capacity it should be understand language completely. Then, having read all of pub med for example, it would be more knowledgeable than all medical professionals on the planet. A 100 trillion parameter model, I’ve calculated, would be more than a billion times more intelligent than the 10 trillion parameter, in terms of iq, while also having the benefits of of all human knowledge and never being tired and being immortal.
What study shows the equivalence of neural network parameters and connections in the brain? What calculations did you do to to get to "a billion times more intelligent"?
https://ai.facebook.com/blog/studying-the-brain-to-build-ai-that-processes-language-as-people-do/
Here is a link to the one of the most recent developments. There are plenty more.
>What calculations did you do to to get to "a billion times more intelligent"?
That's a long discussion based on assumptions I find to be very reasonable. If you insist, I can do go at length. To simplify see the empirical fact that the second most intelligent species, the chimpanzee, has a cortex just 3x smaller than human. The gap intelligence as a result of such an increase is breathtaking. Indeed, quantity leads to vast qualitative leaps. Chimpanzees and gorillas trillions of years from now have no chance of inventing even the simplest tools. If 3x above chimpanzee is human intelligence, what is 10x above human?
They actually do invent tools, but that's not the important thing. What made humans intelligent is having a big brain, and having lots of time. If we were to put a newborn and a baby chimpanzee in a jungle and monitor them, they wouldn't seem all that different regarding intelligence.
Fine if you take that into your calculations, but it can't be attributed to just the bigger brain. Problem being, the 100 trillion parameter model won't have hundreds of thousands of years, and billions of copies of itself.
Cool reference, though! Interesting work
[deleted]
Having difficulty getting the data or physically building the model doesn't mean that the accuracy gains from such a model are diminishing.
That is equivalent to saying asking how fast cars can theoretically go before they fall apart and responding that the speed limit is 65.
It may be difficult to build a 10 trillion parameter model but that doesn't mean it wouldn't be more effective.
You didn't answer the question.
Its an implicit no in the sense that scaling is already slowing
No? There have been a lot of developments of getting results with snaller models though. Basically people figured out ways to not need to train such huge modeks. Which means the bigger models will now be even better. But the focus currently is figuring out how to get the most out of current sizes.
We don't have enough data and compute to make 5 trillion models economically feasible. It just doesn't make sense. It's better to create 500B model and train it properly.
I agree, but you'll find yourself to be a stranger in this thread
I don't know.
Once all the useful representations from the training data has been extracted and learned. Beyond that, increasing model size will overfit the training data. Only language tasks might be solvable by naively scaling current techniques.
Overfitting isn't an issue anymore due to the discovery of double descent/grokking.
Basically, nobody knows, but there are signs it may be possible. It's called the scaling hypothesis; see https://www.gwern.net/Scaling-hypothesis
ReasonablyBadass t1_itts24o wrote
Current transformer architecture may need a few more tweaks for AGI to work, but I'd say it's close already.