LetterRip

LetterRip t1_j85b07d wrote

Why not int4? Why not pruning? Why not various model compression tricks? int4 halves latency. At minimum they would do mixed int4/int8.

https://arxiv.org/abs/2206.01861

Why not distillation?

https://transformer.huggingface.co/model/distil-gpt2

NVidia using FasterTransformer and Triton inference server has a 32x speed up over baseline GPT-J,

https://developer.nvidia.com/blog/deploying-gpt-j-and-t5-with-fastertransformer-and-triton-inference-server/

I think their assumptions are at least an order of magnitude pessimistic.

As someone else notes, the vast majority of queries can be cached. Also there would likely be a Mixture of experts. No need for the heavy duty model when a trivial model can answer the question.

5

LetterRip t1_j78ct6g wrote

It wouldn't matter. LaMDa has no volition, no goals, no planning. A crazy person acting on the belief that an AI is sentient, is no different than a crazy person acting due to hallucinating voices. It is their craziness that is the threat to society, not the AI. This makes the case that we shouldn't allow crazy people access to powerful tools.

Instead of an LLM suppose he said that Teddy Ruxpin was sentient and started doing things on behalf of Teddy Ruxpin

1

LetterRip t1_j78cexp wrote

>These models are adept at writing code and understanding human language.

They are extremely poor at writing code. They have zero understanding of human language other than mathematical relationships of vector representations.

> They can encode and decode human language at human level.

No they cannot. Try any sort of material with long range or complex dependencies and they completely fall apart.

> That's not a trivial task. No parrot is doing that or anything close it.

Difference in scale, not in kind.

> Nobody is going to resolve a philosophical debate on consciousness or sentience on a subreddit. That's not the point. A virus can take and action and so can these models. It doesn't matter whether it's a probability distribution or just chemicals interacting with the environment obeying their RNA or Python code.

No they can't. They have no volition. A language model can only take a sequence of tokens and predict the sequence of tokens that are most probable.

> A better argument would be that the models in their current form cannot take action in the real world, but as another Reddit commentator pointed out they can use humans an intermediaries to write code, and they've shared plenty of code on how to improve themselves with humans.

They have no volition. They have no planning or goal oriented behavior. The lack of actuators is the least important factor.

You seem to lack basic understanding of machine learning or neurological basis of psychology.

8

LetterRip t1_j77y4is wrote

You said,

> The focus should be an awareness that as these systems scale up they believe they're sentient and have a strong desire for self-preservation.

They don't believe they are sentient or have a desire for self-preservation. That is an illusion.

If you teach a parrot to say "I want to rob a bank" - that doesn't mean when the parrot says the phrase it wants to rob a bank. The parrot has no understanding of any of the words, they are a sequence of sounds it has learned.

The phrases that you are interpreting as having a meaning as 'sentient' or 'self-preservation' don't hold any meaning to the AI in the way you are interpreting. It is just putting words in phrases based on probability and abstract models of meaning. The words have abstract relationships extracted from correlations of positional relationships.

If I say "all forps are bloopas, and all bloopas are dinhadas" are "all forps dinhadas" - you can answer that question based purely on semantic relationships, even though you have no idea what a forp, bloopa or dinhada is. It is purely mathematical. That is the understanding that a language model has - sophisticated mathematical relationships of vector representations of tokens.

The tokens vector representations aren't "grounded" in reality but are pure abstractions.

5

LetterRip t1_j77v9m7 wrote

There is no motivation/desire in chat models. They have no goals, wants, or needs. They are simply outputting the most probabilistic string of tokens that is consistent with training and their objective function. The string of tokens can appear to contain phrases that look like they express needs, wants or desires of the AI but that is an illusion.

3

LetterRip t1_j6yj4z2 wrote

GPT-3 can be quantized to 4bit with little loss, to run on 2 Nvidia 3090's/4090's (Unpruned, pruned perhaps 1 3090/4090). At 2$ a day for 8 hours of electricity to run them, and 21 working days per month. That is 42$ per month (plus amortized cost of the cards and computer to store them).

3

LetterRip t1_j6vo0zz wrote

> The model capacity is not spent on learning specific images

I'm completely aware of this. It doesn't change the fact that the average information retained per image is 2 bits. (2GB of parameters/total images learned on in dataset).

> As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?

I didn't say it learned 2 bits of pixel data. It learned 2 bits of information. The information is in a higher dimensional space, so it is much more informative then 2 bits of pixel space data, but it is still an extremely small amount of information.

Given that it often takes about 1000 repetitions of an image to approximately memorize the key attributes. We can infer it takes about 2**10 bits on average to memorize an image. So on average it learns about 1/1000 of the available image data per time it sees an image, or about 1/2 kB equivalent of compressed image data.

11

LetterRip t1_j6v57y5 wrote

Mostly the language model - Imagen is using T5-XXL (the 4.6 billion parameters), Dall-E 2 uses GPT-3 (presumably 2.7B not the much larger variants used for ChatGPT). SD is just using CLIP without anything else. The more sophisticated the language model, the better the image generation can understand what you want. CLIP is close to using bag of words.

18

LetterRip t1_j6ut9kc wrote

> I can't tell which is crazier: that it memorizes images at all, or that memorization is such a small fraction of its overall outputs.

It sees most images between 1 (LAION 2B) and 10 times (aesthetic dataset is multiple epochs). It simply can't learn enough from an image to learn that much about it with that few exposures. If you've tried fine tuning a model on a handful of images it takes a huge numbers of exposures to memorize an image.

Also the model capacity is small enough that on average it can learn 2 bits of unique information per image.

10

LetterRip t1_j6uj087 wrote

> ​Further, "Let's think step by step" is outperformed by "Write Python code to solve this."

Interesting I was just wondering while reading that paper how well that would work compared to the n-shot prompts.

> Ah I see, thanks for clarifying. I see your point, but I wouldn't say that the prompts require an extensive knowledge of the test set. After all:

>> As an example, for the ~10 math reasoning datasets used in PaL, identical prompts were used (same prompt for all datasets, without changing anything).

That's fair. My thoughts were mostly directed at the "Table 2: Solve rate on three symbolic reasoning datasets and two algorithmic datasets" items. I think you could be right that my comments don't apply to the results in Figure 5 (GSM8K GSM-HARD SVAMP ASDIV SINGLEEQ SINGLEOP ADDSUB MULTIARITH).

Would be curious how well the 'write python code to solve this' performs in and of itself vs the "Let's think things through step by step" prompt.

1

LetterRip t1_j6u7cu9 wrote

In my view something like "Let's think things through step by step" prompt is extremely generic and requires no knowledge specific to the upcoming questions.

I was basing my comment on the content of this folder mostly,

https://github.com/reasoning-machines/pal/tree/main/pal/prompt

Each of the prompts seem to require extensive knowledge of the test set to have formulated the prompts.

This seems more akin to Watson where the computer scientists analyzed the form of a variety of questions and did programs for each type of question.

1

LetterRip t1_j43v3yi wrote

This group did such a distillation but didn't share the weights, they got it down to 24 MB.

https://www.reddit.com/r/MachineLearning/comments/p1o2bd/research_we_distilled_clip_model_vit_only_from/

LAION or stability.ai or huggingface might be willing to provide free compute to distill one of the openCLIP models.

Come to think of it, stability.ai should be releasing the distilled stablediffusion latter this month (week or two?) and it presumably will have a distilled clip.

5

LetterRip t1_j3n91mt wrote

I'd do GLM-130B

> With INT4 quantization, the hardware requirements can further be reduced to a single server with 4 * RTX 3090 (24G) with almost no performance degradation.

https://github.com/THUDM/GLM-130B

I'd also look into pruning/distillation and you could probably shrink the model by about half again.

2