satireplusplus

satireplusplus t1_jcp6bu4 wrote

This model uses a "trick" to efficiently train RNNs at scale and I still I have to take a look to understand how it works. Hopefully the paper is out soon!

Otherwise size is what matters! To get there it's a combination of factors - the transformer architecture scales well and was the first architecture that allowed to train these LLMs cranked up to enormous sizes. Enterprise GPU hardware with lots of memory (40G, 80G) and frameworks like pytorch that make parallelizing training across multiple GPUs easy.

And OPs 14B model might be "small" by today's standard, but its still gigantic compared to a few years ago. It's ~27GB of FP16 weights.

Having access to 1TB of preprocessed text data that you can download right away without doing your own crawling is also neat (pile).

3

satireplusplus t1_jcbq2ik wrote

> most notably dropout.

Probably unenforable and math shouldn't be patentable. Might as well try to patent matrix multiplications (I'm sure someone tried). Also dropout isn't even complex math. It's an elementwise multiplication with randomized 1's and 0's, thats all it is.

17

satireplusplus t1_ja9vh89 wrote

It's not the same, but somewhat similar. While they were not entirely sterile, it's likely that 1st generation neanderthal+sapiens had trouble making (male) babies as well:

https://www.smithsonianmag.com/smart-news/humans-and-neanderthals-may-have-had-trouble-making-male-babies-180958701/

8

satireplusplus t1_ja9nry8 wrote

Could be very similar to the concept of a mule, these offspring would be called hybrids:

> The mule is a domestic equine hybrid between a donkey and a horse. It is the offspring of a male donkey (a jack) and a female horse (a mare).[1][2] The horse and the donkey are different species, with different numbers of chromosomes; of the two possible first-generation hybrids between them, the mule is easier to obtain and more common than the hinny, which is the offspring of a female donkey (a jenny) and a male horse (a stallion).

−8

satireplusplus t1_j5v24u2 wrote

If you don't have 8 GPUs you can always run the same computation 8x in series on one GPU. Then you merge the results the same way the parallel implementation would do it. In most cases that's probably gonna end up being a form of gradient accumulation. Think of it this way: you basically compute your distances on a subset of n, but since there are much fewer pairs of distances, the gradient would be noisy. So you just run it a couple of times and average the result to get an approximation of the real thing. Very likely that this is what the parallel implementation does too.

1

satireplusplus t1_j1afqub wrote

This ^^

Compared to GPT3, ChatGPT is a huge step up. There is basically an entire new reward network, as large as the LM, that is able to judge the quality of the answers. See https://cdn.openai.com/chatgpt/draft-20221129c/ChatGPT_Diagram.svg

That said, I'd welome a community effort to build an open source version of this.

81

satireplusplus t1_iznuvy5 wrote

Generating this takes a couple of seconds and it can probably be done on a single high end GPU (for example, eleuther.ai models run just fine on one GPU). Ever played a video game? You probably "wasted" 1000x as much energy in just one hour.

The real advantage is that this can really speed up your programming and it can program small functions all by itself. It is much better than stackoverflow.

5

satireplusplus t1_iznkxx5 wrote

I've actually had it explain an obscure warning, faster than googling it and already tells you what to do to get rid of the warning.

I've also found ChatGPT super useful for mudane stuff too, create a regex for a certain pattern giving it just a description and one example, create a flask API end point with a description of what it does etc. Code often works out of the box, sometimes needs minor tweeks. But its much easier to correct a regex with one minor issue than writing it from scratch.

10