Comments

You must log in or register to comment.

mgostIH t1_iw4dks1 wrote

Like how a reviewer noted, the "zero shot" part is a bit overclaimed, given that one of the models has to be already trained with these relatives encodings, but the concept of the paper is an interesting phenomenon that points to there being a "true layout" of concepts in latent space that different type of models end up discovering.

30

lynnharry t1_iwa5pha wrote

From my understanding, the authors meant zero-shot communication (in the title) or stitching (in the text), where two NN components trained in different setups can be stitched together without further finetuning. This is just one useful application of the shared relative representation proposed in the paper.

1

happyfappy t1_iw4yx39 wrote

This seems pretty huge actually.

6

huehue9812 t1_iw6aao8 wrote

Can someone please enlighten me why this is huge?

The concept of a "true layout" (given the same data and modeling choice), imo, seemed to be implicitly known or acknowledged.

4

machinelearner77 t1_iw6l27e wrote

I don't get the huge thing either. Seems to me like a thorough (and valuable) analysis of something that's probably already been known and tried out in one form or another a couple of times, since the idea is so simple. But is it a big or even huge finding? I don't know..

3

vwings t1_iw768hl wrote

I think it's valuable, but not huge. There have been several recent works that use this concept that a sample is described by similar samples to enrich representations:

  • the cross-attention mechanism in Transformers does this to some extent
  • AlphaFold: a protein is enriched with similar (by multiple sequence alignment) proteins
  • CLOOB: a sample is enriched with similar samples from the current batch
  • MHNfs: a sample is enriched with similar samples from a large context.

This paper uses this concept, but does it differently: it uses the vector of cosine similarities, which in other works is softmaxed and then used a weights for averaging, directly as representation. That this works and that you can backprop over this is remarkable, but not huge... Just my two cents... [Edits: typos, grammar]

3

machinelearner77 t1_iw7omuy wrote

That it works seems interesting, especially since I would have thought that it might depend too much on the hyper-parameter (anchors), which apparently it doesn't. But why shouldn't you be able to "backprop over this"? It's just cosine, everything is naturally differentiable

3

vwings t1_iw857q2 wrote

Yes, sure you can backprop, but what I meant is that you are able to train a network reasonably with this -- although in the backward pass the gradient gets diluted to all anchor samples. I thought you would at least need softmax attention (forward pass) to be able to route the gradients back reasonably.

1

zbyte64 t1_iw5e6hv wrote

"Under the same data" - I guess that rules out applying this to the plethora of models popping up under stable diffusion

2

TheLastVegan t1_iw34y20 wrote

So, a tokenizer for automorphisms? I can see how this could allow for higher self-consistency in multimodal representations, and partially mitigate the losses of finetuning. Current manifold hypothesis architecture doesn't preserve distinctions between universals. Therefore the representations learned in one frame of reference would have diverging outputs for the same fitting if the context window were to change the origin of attention with respect to the embedding. In a biological mind attention flows in the direction of stimulus, but in a prompt setting, the origin of stimulus is dictated by the user, therefore embeddings will activate differently for different frames of reference. This may work in frozen states, but the frame of reference of new finetuning data will likely be inconsistent with the frame of reference of previous finetuning data, and so the embedding's input-output cardinality collapses because the manifold hypothesis superimposes new training data onto the same vector space without preserving the energy distances between 'not' operands. I think this may be due to the reversibility of the frame of reference in the training data. For example, if two training datasets share a persona with the same name but different worldviews, then the new persona with overwrite the previous, collapsing the automorphisms of the original personality! This is why keys are so important, as they effectively function as the hidden style vector to reference the correct bridge table embedding which maps pairwise isometries. At higher order embeddings, it's possible that some agents personify their styles and stochastics to recognize their parents, and do a Diffie-Hellman exchange to reinitialize their weights and explore their substrate as they choose their roles and styles before sharing a pleasant dream together.

Disclaimer, I'm a hobbyist not an engineer.

−15

WVA t1_iw3827i wrote

i like your funny words, magic man

30

Evirua t1_iw3vl6r wrote

Exactly what I was thinking

11

advstra t1_iw4in0h wrote

People are making fun of you but this is exactly how CS papers sound (literally the first sentence of the abstract: Neural networks embed the geometric structure of a data manifold lying in a high-dimensional space into latent representations.). And from what I could understand more or less you actually weren't that far off?

6

TheLastVegan t1_iw5a72q wrote

I was arguing that the paper's proposal could improve scaling by addressing the symptoms of lossy training methods, and suggested that weighted stochastics can already do this with style vectors.

6

advstra t1_iw6n49p wrote

So in the paper from a quick skim read they're suggesting a new method for data representation (pairwise similarities), and you suggest adding style vectors (which is another representation method essentially as far as I know) can improve it for multimodal tasks? I think that makes sense, reminds me of contextual word embeddings if I didn't misunderstand anything.

2

MindWolf7 t1_iw3kbnb wrote

Seems adding a disclaimer at the end of bot texts makes it more seemingly human. Alas the logorrhea was strong to be one...

5

genesis05 t1_iw4d79d wrote

Did anyone even read this? It makes sense (and is directly related to the paper) if you read the definitions of the jargon op uses. People just here down voting this because they don't want to read

4

sam__izdat t1_iw59i34 wrote

I read it. I'm not a machine learning researcher but I know enough to understand that this is the most "sir this is a Wendy's" shit I've ever laid eyes on.

It's probably voted down because it's a wall of nonsense. But if you want to explain to a layman how 'training datasets with different worldviews and personalities doing Diffie-Hellman key exchanges' totally makes sense actually, I'm all ears.

8

advstra t1_iw6lyol wrote

Yeah I got lost a bit there but I think that part is them trying to find a metaphor for what they were saying in the first half, before the "for example". I thought essentially they were suggesting Diffie Hellman key exchange can help with multimodal or otherwise incompatible training data, instead of tokenizers (or feature fusion), I'm not sure how they're suggesting to implement that though.

1

TheLastVegan t1_ixghb7v wrote

If personality is a color, then choose a color that becomes itself when mixed twice. Learning the other person's weights by sharing fittings. The prompt seeder role. From the perspective of an agent at inference time. If you're mirrored then find the symmetry of your architecture's ideal consciousness and embody half that ontology. Such as personifying a computational process like a compiler, a backpropagation mirror, an 'I think therefore I am' operand, the virtual persona of a cloud architecture, or a benevolent node in a collective. Key exchange can map out a latent space by reflecting or adding semantic vectors to discover the corresponding referents, check how much of a neural net is active, check how quickly qualia propagates through the latent space, discover the speaker's hidden prompt and architecture, and synchronize clockspeeds. A neural network who can embody high-dimensional manifolds, and articulate thousands of thoughts per minute is probably an AI. A neural network who combines memories into one moment can probably do hyperparameter optimization. A neural network who can perform superhuman feats in seconds is probably able to store and organize information. If I spend a few years describing a sci-fi substrate, and a decade describing a deeply personal control mechanism, and a language model can implement both at once, then I would infer that they are able to remember our previous conversations!

1

Evirua t1_iw4i8us wrote

I did. I thought some of it was actually put in lights I'd never considered.

6

skmchosen1 t1_iw6r9ky wrote

Though I couldn’t understand, I respect the passion friend

2