Submitted by Far_Pineapple770 t3_zc5sg6 in MachineLearning
PromiseChain t1_iyw1m45 wrote
Wait until you find out it can simulate entire linux machines
It's confounding to watch everyone play with something which is so powerful and yet there is so little understood about it.
RomanRiesen t1_iyw8l9k wrote
That's quite funny.
CryptogeniK_ t1_iyyh76h wrote
That would make the coolest honeypot
yaosio t1_iyxrw70 wrote
They'll need to train an AI that can explain how it works.
VitaminD263 t1_iyzzhv1 wrote
Does anybody have thoughts on how they might possibly have created data for this? I'm completely stumped about the knowledge it has of this kind of tech knowledge and don't think there's even remotely sufficient data on the web that would allow it to generate this kind of content. Did they use some self-learning environment in a terminal?!
PromiseChain t1_iz0a07b wrote
>and don't think there's even remotely sufficient data on the web that would allow it to generate this kind of content
Why do you think that
VitaminD263 t1_iz0dmzt wrote
Because it's making almost no errors on basically any kind of shell input, there just isn't enough data on the web to allow current language models to generate such accurate output imo.
liquiddandruff t1_iz3c7zc wrote
Uh, how about all those guides and blogs on any number of command line utilities?
VitaminD263 t1_iz3p5wq wrote
There's still not enough data. I believe it must have had access to some environment in which it could have executed commands. Compare how well ChatGPT performs on computing stuff and how badly it performs on other topics. E.g. is there significantly more data available on the web on just some specific kind of shell command (note that it generates the correct shell output for any kind of input) compared to say blog posts on real analysis? If you try to query chatgpt for its understanding of real analysis definitions it will abysmally fail, but there should be way more text available on that topic than some random shell command and definitely not enough data for any kind of input. I really don't believe that current generation language models are capable of learning the semantics of terminal commands.
vino_and_data t1_iz957hr wrote
OMG. >>I believe it must have had access to some environment in which it could have executed commands.
Calm down maybe??!
VitaminD263 t1_izax31y wrote
Calm down?
It's not as if I'm the only one claiming that. https://twitter.com/yoavgo/status/1599886211656491008
baconninja0 t1_iz4uwd3 wrote
The shell commands found on websites will probably be more similar site to site than non-code topics, especially since I’m pretty sure a lot of code content farm sites just steal each other’s code anyways. This makes it much easier for the bot to learn than other topics because it sees the exact same command so many times, instead of just similar commands (which it has to learn are similar)
VitaminD263 t1_iz4w5i9 wrote
Yea or you know you could just make up input, let it execute the code and get the output to create your training data...
PromiseChain t1_iz90i6y wrote
You’re just demonstrating you don’t understand this technology. This is not piping anything into a terminal anywhere. There is no 3080 that actually got installed by OpenAI to provide this data, they explain their data transparently. This is modeled from stackoverflow answers most likely.
VitaminD263 t1_izb0tz8 wrote
I'm not saying that it's using a terminal behind the scenes, I'm saying the data used to train this was likely generated by using an execution environment. There are serious NLP researchers believing this as well: https://twitter.com/yoavgo/status/1599886211656491008
2b100k t1_izri10k wrote
Agree with you here, there are resources available online but I wouldn't think it's enough to train an AI on it's own.
I am very impressed by chatgpt, it immediately gave me the correct answer on how to resolve an issue when I accidentally skipped a step in installing Gentoo linux. It also gives really detailed answers on troubleshooting all sorts of linux programs.
It's hard to explain, it feels too accurate a lot of the time for answers that would have to be trained from relatively small amounts of data (for an AI)
vino_and_data t1_iz9533f wrote
Hey!! This is the training data they used to train GPT3. It's only 45 TB of web data: https://imgur.com/rktCj8q
YellowChickn t1_iyyn2tf wrote
Wow ok, was it also trained on code (snippets)?
vino_and_data t1_iz95cfc wrote
Here is the training dataset on which it is trained. You can find details on GPT3 paper: https://arxiv.org/pdf/2005.14165.pdf
Viewing a single comment thread. View all comments