Submitted by spiritus_dei t3_11qgxs8 in MachineLearning

One of the biggest limitations of large language models is the text limit. This limits their use cases and prohibits more ambitious prompts.

This was recently resolved by researchers at Google Brain in Alberta, Canada. In their recent paper they describe a new method of using associative memory which removes the text limit and they also prove that some large language models are universal Turing machines.

This will pave the way for entire novels being shared with large language models, personal genomes, etc.

The paper talks about the use of "associative memory" which is also known as content-addressable memory (CAM). This type of memory allows the system to retrieve data based on its content rather than its location. Unlike traditional memory systems that use specific memory addresses to access data, associative memory uses CAM to find data based on a pattern or keyword.

Presumably, this will open up a new market for associative memory since I would happily pay some extra money for content to be permanently stored in associative memory and to remove the text limit. This will also drive down the price of associative memory if millions of people are willing to pay a monthly fee for storage and the removal of prompt text limits.

The paper does point that there are still problems with conditional statements that confuse the large language models. However, I believe this can be resolved with semantic graphs. This would involve collecting data from various sources and using natural language processing techniques to extract entities and relationships from the text. Once the graph is constructed, it could be integrated into the language model in a variety of ways. One approach is to use the graph as an external memory, similar to the approach taken in the paper. The graph can be encoded as a set of key-value pairs and used to augment the model's attention mechanism during inference. The attention mechanism can then focus on relevant nodes in the graph when generating outputs.

Another potential approach is to incorporate the graph into the model's architecture itself. For example, the graph can be used to inform the initialization of the model's parameters or to guide the attention mechanism during training. This could help the model learn to reason about complex concepts and relationships more effectively, potentially leading to better performance on tasks that require this kind of reasoning.

The use of knowledge graphs can also help ground truth large language models and reduce hallucinations.

I'm curious to read your thoughts.

64

Comments

You must log in or register to comment.

big_ol_tender t1_jc4lrqf wrote

This makes me depressed because I’ve been working with the llama-index project and I feel like these huge companies are going to take my ball away 😢. They just have too many resources to build stuff.

33

Hostilis_ t1_jc4rnu1 wrote

Unfortunately I think, at least for now, that's just the way it is. This is why I personally focus on hardware architectures / acceleration for machine learning and biologically plausible deep learning. Ideas tend to matter more than compute resources in these domains.

27

127-0-0-1_1 t1_jc4umjb wrote

How are they going to take your ball away? By having a nicer ball?

Of course you, alone, is going to produce worse products than a bunch of postdoctorates with the budget of a small nation state.

8

sebzim4500 t1_jc6jye3 wrote

The company doesn't always win, sometimes the open source product is simply better. See Stable Diffusion vs DALL-E, or linux vs windows server, or lichess vs chess.com, etc.

Of course that doesn't mean it will be used more, but that isn't the point.

5

apluskale t1_jc6zl9r wrote

You have to remember that Dall-E is worse only because there's little interest and money in it. Text is much more useful/hyped compared to images.

2

googler_ooeric t1_jc7q62q wrote

i'd say it depends, DALL-E is better at photorealistic stuff and stability from my experience, but Stable Diffusion is way more versatile and can actually replicate famous IPs

0

Necessary_Ad_9800 t1_jc5x92i wrote

Yea they are years ahead but don’t you think the open source community will be able to make something useful given enough time?

2

xKraazY t1_jc79i02 wrote

Don't use external libraries because they abstract important concepts (talking about langchain and llama-index). They're great for starting out, but the rate at which everything is moving, these libraries become obsolete in 2-3 months.

1

baffo32 t1_jc8jgd4 wrote

i’m thinking, with practice and research, these abstractions could be done in dynamic ways that can pivot and diversify to new norms

2

baffo32 t1_jc8j4w0 wrote

thoughts: each approach has generally something unique that can make it useful, and approaches usually have ways in which they can merge

1

spiritus_dei OP t1_jc33gcf wrote

Here is a link to the paper: https://arxiv.org/pdf/2301.04589.pdf

23

sangbui t1_jc5zt31 wrote

Thank you. Looking through the comments for the link.

1

imaginethezmell t1_jcpfo39 wrote

pretty much everyone already had to do this for any long text implementation

embed everything

search embeddings

use prompt + search result for final prompt

profit

1

suflaj t1_jc7119l wrote

This is not something new. It was already present 6 years ago, pioneered by Graves et al (https://www.nature.com/articles/nature20101). The takeaway was that it's hard, if not impossible to train.

The paper did not present any benchmarks on known sets. Until that happens, sadly, there is nothing really to discuss. Neat idea, but DL is all about results nowadays.

I was personally working on a full neural memory system myself, I built the whole framework for it, just to find out it wouldn't train on even a toy task. Graves' original work required curriculum learning to work for even toy tasks, and I am not aware of any significant achievement using his Differentiable Neural Computers.

8

[deleted] t1_jc732z3 wrote

[deleted]

1

suflaj t1_jc73bnx wrote

I have skimmed over it before writing this. They have what working? Synthetic toy examples? Great, Graves et al. had even more practically relevant problems solved 6 years ago. The thing is, it never translated into solving real world problems, and the paper and follow up work didn't really manage to demonstrate how it could actually be used.

So, until this paper results in some metrics on known datasets, model frameworks and weights, I'm afraid there's nothing really to talk about. Memory augmented networks are nasty in the sense that they require transfer learning or reinforcement learning to even work. It's hard to devise a scheme where you can punish bad memorization or recall, because it's hard to link the outcome of some recall + processing to the process that caused such recall.

Part of the reason for bad associative memorization and recall is the data itself. So naturally, it follows that you should just be able to optimize the memorized data, no? Well, it sounds trivial, but it ends up either non-differentiable (because of an exact choice, rather than a fuzzy one), or hard to train (vanishing or sparse gradients). And you have just created a set of neural networks, rather than just a monolithic one. That might be an advantage, but it is nowhere near as exciting as this paper would lead you to believe. And that would not be novel at all: hooking up a pretrained ResNet with a classifier would be of the same semantics as that, if you consider the ResNet a memory bank: a 7 year old technique at this point.

Memorizing things with external memory is not exactly a compression task, which DNNs and gradient descent solve, so it makes sense that it's hard in a traditional DL setting.

3

spiritus_dei OP t1_jc7ccww wrote

>I have skimmed over it before writing this. They have what working? Synthetic toy examples? Great, Graves et al. had even more practically relevant problems solved 6 years ago. The thing is, it never translated into solving real world problems, and the paper and follow up work didn't really manage to demonstrate how it could actually be used.
>
>So, until this paper results in some metrics on known datasets, model frameworks and weights, I'm afraid there's nothing really to talk about. Memory augmented networks are nasty in the sense that they require transfer learning or reinforcement learning to even work. Memorizing things with external memory is not exactly a compression task, which DNNs and gradient descent solve.

The same could have been said of Deep Learning until the Image Net breakthrough. The improvement process is evolutionary, and this may be a step in that process.

You make a valid point. While the paper demonstrates the computational universality of memory-augmented language models, it does not provide concrete metrics on known datasets or model frameworks. Additionally, as you mentioned, memory-augmented networks can be challenging to train and require transfer learning or reinforcement learning to work effectively.

Regarding the concern about transfer learning, it is true that transferring knowledge from one task to another can be challenging. However, recent research has shown that transfer learning can be highly effective for certain tasks, such as natural language processing and computer vision. For example, the BERT model has achieved state-of-the-art performance on many natural language processing benchmarks using transfer learning. Similarly, transfer learning has been used to improve object recognition in computer vision tasks.

As for reinforcement learning, it has been successfully applied in many real-world scenarios, including robotics, game playing, and autonomous driving. For example, AlphaGo, the computer program that defeated a world champion in the game of Go, was developed using reinforcement learning.

This is one path and other methods could be incorporated such as capsule networks, which aim to address the limitations of traditional convolutional neural networks by explicitly modeling the spatial relationships between features. For example, capsule networks could be used in tandem with memory augmented networks by using capsule networks to encode information about entities and their relationships, and using the memory augmented networks to store and retrieve this information as needed for downstream tasks. This approach can be especially useful for tasks that involve complex reasoning, such as question answering and knowledge graph completion.

Another approach is to use memory augmented networks to store and update embeddings of entities and their relationships over time, and use capsule networks to decode and interpret these embeddings to make predictions. This approach can be especially useful for tasks that involve sequential data, such as language modeling and time-series forecasting.

0

suflaj t1_jc7jibo wrote

> The same could have been said of Deep Learning until the Image Net breakthrough. The improvement process is evolutionary, and this may be a step in that process.

This is not comparable at all. ImageNet is a database for a competition - it is not a model, architecture or technique. When it was "beaten", it was beaten not by a certain philosophy or ideas, it was beaten by a proven implementation of a mathematically sound idea.

This is neither evaluated on a concrete dataset, nor is it delved into deeply in the mathematical sense. This is a preprint of an idea that someone fiddled with using a LLM.

> As for reinforcement learning, it has been successfully applied in many real-world scenarios, including robotics, game playing, and autonomous driving.

My point is that so has the 6 year old DNC. The thing is, however, that neither of those is your generic reinforcement learning - they're very specifically tuned for the exact problem they are dealing with. If you actually look at what is available for DRL, you will see that aside from very poor framework support, probably the best we have is Gym, the biggest issue is how to even get the environment set up to enable learning. The issue is in making the actual task you're learning easy enough for the agent to even start learning. The task of knowing how to memorize or recall is incredibly hard, and we humans don't even understand memory well enough to construct problem formulations for those two.

Whatever technique you come up with, if you can't reproduce it for other problems or models, you will just be ending up with a specific model. I mean - look at what you are saying. You're mentioning AlphaGo. Why are you mentioning a specific model/architecture for a specific task? Why not a family of models/architectures? Maybe AlphaZero, AlphaGo, MuZero sound similar, but they're all very, very different. And there is no real generalization of them, even though they all represent reinforcement learning.

> This is one path and other methods could be incorporated such as capsule networks, which aim to address the limitations of traditional convolutional neural networks by explicitly modeling the spatial relationships between features.

And those are long shown to be a scam, basically. Well, maybe not fundamentally scam, but definitely dead. Do you know what essentially killed them? Transformers. And do you know why Transformers are responsible for almost killing the rest of DL architectures? Because they showed actual results. The paper that is the topic of this thread fails to differentiate the contribution of this method disregarding the massive transformer they're using alongside it. If you are trying to show the benefits of a memory augmented system, why simply not use a CNN or LSTM as controller? Are the authors implying that this memory system they're proposing needs a massive transformer to even use it? Everything about it is just so unfinished and rough.

> Another approach is to use memory augmented networks to store and update embeddings of entities and their relationships over time, and use capsule networks to decode and interpret these embeddings to make predictions. This approach can be especially useful for tasks that involve sequential data, such as language modeling and time-series forecasting.

Are you aware that this exactly has been done by Graves et al., where the external memory is essentially a list of embeddings that is 1D convoluted on? The problem, like I mentioned, is that this kind of process is barely differentiable. Even if you do fuzzy search (Graves at al. use sort of an attention based on access frequency alongside the similarity one), your gradients are so sparse your network basically doesn't learn anything. Furthermore, the output of your model is tied to this external memory. If you do not optimize the memory, then you are limiting the performance of your model severely. If you are, then what you're doing is nothing novel, you have just arbitrarily decided that part of your monolithic network is memory, even though it's just one thing.

2

massagetae t1_jc53dxm wrote

Sounds like memory networks.

6

sEi_ t1_jc6k44a wrote

IMPORTANT

I see an influx of POSTS WITHOUT REFERENCES.

When you in the start say: "recently resolved by researchers" and I see no blue link I can check, then I scroll past the post.

And even "The paper..." many times. What paper?

I simply ignore posts like this. Life is too short to read peoples dreams.

EDIT:
When citing stuff please put the link in the body of the post. So I do not have to search for it down the thread.

4

Spiritual-Reply5896 t1_jc5s7ew wrote

How is the similarity between synonyms or semantically similar sentences ensured if regex is used for retrieving the input prompts? Maybe I missed something as I skimmed over the paper, but that was the impression I got

3

JigglyWiener t1_jc41igr wrote

Gonna check this out when I get home. Thanks!

1

[deleted] t1_jc5rrce wrote

This is for sure not a problem solely solved by google brain researchers .

1

JClub t1_jc5ys39 wrote

Is there any implementation of CAM? Why is this better than the tglobal attention used in LongT5?

1

gmork_13 t1_jc6e3ox wrote

The way I was going to implement it with the chatgpt API was to store the conversation and have the model itself extract keywords of the conversation so far as it neared the token limit.

Then you can inject the keywords and search the previous conversation.

But this is still nothing like truly extending the actual memory of the model.

1

baffo32 t1_jc8jvoo wrote

start some code! invite contributors! :)

1