pm_me_your_pay_slips t1_jee2xtt wrote on March 31, 2023 at 10:36 AM

Reply to comment by phire in [P] Introducing Vicuna: An open-source language model based on LLaMA 13B by Business-Lead2679

Perhaps applicable to the generated outputs of the model, but it’s not a clear case for the inputs used as training data. It could very well end up in the same situation as sampling in the music industry. Which is transformative, yet people using samples have to “clear” them by asking for permission (usually involves money).

pm_me_your_pay_slips t1_jdv748e wrote on March 27, 2023 at 1:15 PM

Reply to comment by yaosio in [R] Reflexion: an autonomous agent with dynamic memory and self-reflection - Noah Shinn et al 2023 Northeastern University Boston - Outperforms GPT-4 on HumanEval accuracy (0.67 --> 0.88)! by Singularian2501

this is literally what gdb did during the GPT-4 launch livestream

pm_me_your_pay_slips t1_jdv6l50 wrote on March 27, 2023 at 1:10 PM

Reply to comment by ellev3n11 in [R] Reflexion: an autonomous agent with dynamic memory and self-reflection - Noah Shinn et al 2023 Northeastern University Boston - Outperforms GPT-4 on HumanEval accuracy (0.67 --> 0.88)! by Singularian2501

while the paper doesn't mention any code, there is no practical difference: replace RL environment with compiler/interpreter, and action selection with prompt engineering.

pm_me_your_pay_slips t1_jdeyz79 wrote on March 23, 2023 at 10:24 PM

Reply to comment by Tejalapeno in [R] Introducing SIFT: A New Family of Sparse Iso-FLOP Transformations to Improve the Accuracy of Computer Vision and Language Models by CS-fan-101

Sure, my next paper will introduce Transformers, a new method for distillation of neural network models.

pm_me_your_pay_slips t1_jcatwi5 wrote on March 15, 2023 at 2:56 PM

Reply to [D] What do people think about OpenAI not releasing its research but benefiting from others’ research? Should google meta enforce its patents against them? by [deleted]

They don't release that information because they don't want to lose their competitive advantage to other companies. It's a race towards AGI/Transformative AI. It could alsoo be a race for resources: e.g. convincing the US government to concentrate its funding on the leading AI project alone. This means any release of details may come only when OpenAI knows that trainnig for the next generation of models is running without problems.

This is likely based on the idea that newer models can be used to design/build/train the next generation of models, leading to an exponential amplification of capabilities over time that makes any lead time over the competition a decisive factor.

pm_me_your_pay_slips t1_jc4bik9 wrote on March 13, 2023 at 11:01 PM

Reply to comment by Disastrous_Elk_6375 in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

This is incredibly informative.

pm_me_your_pay_slips t1_j7l6icx wrote on February 7, 2023 at 4:31 PM

Reply to comment by orbital_lemon in [N] Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement by Wiskkey

note that the VQ-VAE part of the SD model alone can encode and decode arbitrary natural/human-made images pretty well with very little artifacts. The diffusion model part of SD is learning a distribution of images in that encoded space.

pm_me_your_pay_slips OP t1_j6ypajq wrote on February 2, 2023 at 8:50 PM

Reply to comment by DigThatData in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips

>This isn't a good definition for "memorization" because it's indistinguishable from how we define outliers.

The paper has this to say about your point

> If highly memorized observations are always given a low probability when they are included in the training data, then it would be straightforward to dismiss them as outliers that the model recognizes as such. However, we find that this is not universally the case for highly memorized observations, and a sizable proportion of them are likely only when they are included in the training data.

> Figure 3a shows the number of highly memorized and “regular” observations for bins of the log probability under the VAE model for CelebA, as well as example observations from both groups for different bins. Moreover, Figure 3b shows the proportion of highly memorized observations in each of the bins of the log probability under the model. While the latter figure shows that observations with low probability are more likely to be memorized, the former shows that a considerable proportion of highly memorized observations are as likely as regular observations when they are included in the training set. Indeed, more than half the highly memorized observations fall within the central 90% of log probability values.

TLDR if this method was giving you a high score to outliers only, then these samples would have low likelihood when they were included in the training data (because they are outliers). But the authors observed sizeable proportion of the samples with high memorization score to be as likely as regular (inlier) data.

pm_me_your_pay_slips OP t1_j6yl0wq wrote on February 2, 2023 at 8:24 PM

Reply to comment by DigThatData in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips

The first paper proposes a way of quantifying memorization by looking at pairs of prefixes and postfixes and observing whether the postfixes wer generated by the model when the prefixes were used as prompts.

The second paper has this to say about generalization:

> A natural question at this point is to ask why larger models memorize faster? Typically, memorization is associated with overfitting, which offers a potentially simple explanation. In order to disentangle memorization from overfitting, we examine memorization before overfitting occurs, where we define overfitting occurring as the first epoch when the perplexity of the language model on a validation set increases. Surprisingly, we see in Figure 4 that as we increase the number of parameters, memorization before overfitting generally increases, indicating that overfitting by itself cannot completely explain the properties of memorization dynamics as model scale increases.

In fact, this is the title of the paper: "Memorization without overfitting".

> Anyway, need to read this closer, but "lower posterior likelihood" to me seems fundamentally different from "memorized".

The memorization score is not "lower posterior likelihood", but the log density ratio for a sample: log( p(sample| dataset including sample)/p(sample| dataset excluding sample) ) . Thus, a high memorization score is given to samples that go from very unlikely when not included to as likely as the average sample when included in the training data, or from as likely as the average training sample when not included in the training data to above-average likelihood when included.

pm_me_your_pay_slips OP t1_j6wn43x wrote on February 2, 2023 at 12:30 PM

Reply to comment by DigThatData in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips

That models that memorize better generalize better has been observed in large language models:
https://arxiv.org/pdf/2202.07646.pdf

https://arxiv.org/pdf/2205.10770.pdf

An interesting way to quantify memorization is proposed here, although it will be expensive for a model like SD: https://proceedings.neurips.cc/paper/2021/file/eae15aabaa768ae4a5993a8a4f4fa6e4-Paper.pdf.

Basically: you perform K-fold cross validation and measure how much more likely the image is when included in the training dataset vs when it is not included. For memorized images, the likelihood of the images when not used in the dataset drops to close to zero. Note that they caution against using the nearest neighbour distance to quantify memorization as it is not correlated with the described memorization score.

pm_me_your_pay_slips OP t1_j6vgxpe wrote on February 2, 2023 at 4:04 AM

Reply to comment by LetterRip in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips

>on average it can learn 2 bits of unique information per image.

The model capacity is not spent on learning specific images, but on learning the mapping from noise to latent vectors corresponding to natural images. Human-made or human-captured images have common features shared across images, and that's what matters for learning the mapping.

As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?

pm_me_your_pay_slips OP t1_j6uw5xs wrote on February 2, 2023 at 1:26 AM

Reply to comment by LetterRip in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips

where do you get that number?

pm_me_your_pay_slips t1_j5uxih0 wrote on January 25, 2023 at 7:11 PM

Reply to comment by BinodBoppa in [D]Are there any known AI systems today that are significantly more advanced than chatGPT ? by Xeiristotle

He's a very famous AI pioneer from rance.

pm_me_your_pay_slips t1_j48k2ve wrote on January 13, 2023 at 10:23 PM

Reply to comment by Farconion in [D] Bitter lesson 2.0? by Tea_Pearce

I guess so, there's nothing bitter in this so-called "bitter lesson 2.0"

pm_me_your_pay_slips t1_j488487 wrote on January 13, 2023 at 9:07 PM

Reply to comment by psychorameses in [D] Bitter lesson 2.0? by Tea_Pearce

Except one software engineer + a foundation model for code generation may be able to replace 10 engineers. I'm taking that ratio out of my ass, but it might as well be that one engineer + foundation model replaces 5 or 100. Do you count yourself as that one in X engineers that won't lose their job in Y years?

pm_me_your_pay_slips t1_j487k7k wrote on January 13, 2023 at 9:04 PM

Reply to comment by Farconion in [D] Bitter lesson 2.0? by Tea_Pearce

foundation models are mainstream now. Look at the curriculum of all top ML programs, they all have a class on scaling laws and big models.

pm_me_your_pay_slips t1_j48741u wrote on January 13, 2023 at 9:01 PM

Reply to comment by chimp73 in [D] Bitter lesson 2.0? by Tea_Pearce

The bitter lesson will be when fine-tuning and training from scratch become the same thing.

pm_me_your_pay_slips t1_j486wz7 wrote on January 13, 2023 at 9:00 PM

Reply to [D] Bitter lesson 2.0? by Tea_Pearce

Since scaling laws and foundational models are mainstream now, to whom is this "Bitter lesson 2.0" addressed?

pm_me_your_pay_slips t1_j3loz71 wrote on January 9, 2023 at 12:55 PM

Reply to [D] What is the most complete reference on the history of neural networks? by gbfar

In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. And God said, Let there be light: and there was light....

And God said, Let us make man in our image, after our likeness: and let them have dominion over the fish of the sea, and over the fowl of the air, and over the cattle, and over all the earth, and over every creeping thing that creepeth upon the earth. So God created man in his own image, in the image of God created he him; male and female created he them.

And Jürgen Schmidhuber chastised God for failing to cite his papers since his creation of man and woman are special cases of Artificial Curiosity and Predictability Minimzation.