Submitted by olegranmo t3_102bm7w in MachineLearning

​

Logical Word Embedding with Tsetlin Machine Autoencoder

Here is a new self-supervised machine learning approach that captures word meaning with concise logical expressions. The logical expressions consist of contextual words like “black,” “cup,” and “hot” to define other words like “coffee,” thus being human-understandable. I raise the question in the heading because our logical embedding performs competitively on several intrinsic and extrinsic benchmarks, matching pre-trained GLoVe embeddings on six downstream classification tasks. You find the paper here: https://arxiv.org/abs/2301.00709, an implementation of the Tsetlin Machine Autoencoder here: https://github.com/cair/tmu, and a simple word embedding demo here: https://github.com/cair/tmu/blob/main/examples/IMDbAutoEncoderDemo.py

314

Comments

You must log in or register to comment.

Mental-Swordfish7129 t1_j2s6xlg wrote

Interesting. I've had success encoding the details of words (anything, really) using high-dimensional binary vectors. I use about 2000 bits for each code. It's usually plenty as it is often difficult to find 2000 relevant binary features of a word. This is very efficient for my model and allows for similarity metrics and instantiates a truly enormous latent space.

52

clauwen t1_j2sucur wrote

Maybe im an idiot, but depending on precision, this is not much smaller of an encoding, as a lot of other model use, right? And none of the state of the art embedding models are at all optimized for space, right?

24

Mental-Swordfish7129 t1_j2t17wy wrote

Idk much about other encoding systems. This works well for my purposes. It's scalable. I look at my data and ask, "how many binary features of each datum are salient and also which features are important to the model for judging similarities"? 2000 may be too much sometimes. Also, remember that a binary vector is often handled as an integer array indicating the index of bits set to 1. If your vectors are sparse it can be very efficient. For the AI models I build, my vectors are often quite sparse because I often use a scheme like a "slider" of activations for integer data; sort of like "one hot", but I'll do three or more consecutive to encode associativity.

10

Mental-Swordfish7129 t1_j2t22qq wrote

The biggest reason I use this encoding is because of the latent space it creates. My AI models are of the SDM variety with a predictive processing architecture computing something very similar to active inference. This encoding allows for complete universality and the latent space provides for the generation of semantically relevant memory abstractions.

8

maizeq t1_j2to73g wrote

What type of predictive processing architecture exactly if you don’t mind saying?

5

Mental-Swordfish7129 t1_j2tubij wrote

It's pretty vanilla.

Message passing up is prediction error.

Down is prediction used as follows:

I use the bottom prediction to characterize external behavior.

Prediction at higher levels characterizes attentional masking and other alterations to the ascending error signals.

3

maizeq t1_j2w7p8k wrote

Is this following a pre-existing methodology in the literature or something custom for your usage? I usually see attention in PP implemented, conceptually at least, as variance parameterisation/optimisation over a continuous space. How do you achieve something similar in your binary latent space?

Sorry for all the questions!

2

Mental-Swordfish7129 t1_j2x3juw wrote

Idk if it's in the literature. At this point, I can't tell what I've read from what has occurred to me.

I keep track of the error each layer generates and also a brief history of its descending predictions. Then, I simply reinforce the generation of predictions which favor the highest rate of reduction in subsequent error. I think this amounts to a modulation of attention (manifested as a pattern of bit masking of the ascending error signal) which amounts to ignoring the portions of the signal which have low information and high variance.

At the bottom layer, this is implemented as choosing behaviors (moving a reticle over an image u,d,l,r) which accomplish the same avoidance of high variance and thus high noise, but seeking high information gain.

The end result is a reticle which behaves like a curious agent attempting to track new, interesting things and study them a moment before getting bored.

The highest layers seem to be forming composite abstractions on what is happening below, but I have yet to try to understand.

I'm fine with questions.

3

Mental-Swordfish7129 t1_j2xqlwa wrote

The really interesting thing as of late is that if I "show" the agent, as part of its input, its global error metric alongside forcing (moving the reticle directly) it out of boredom toward higher information gain, I can eventually stop the forcing because it learns to force itself out of boredom. It seems to learn the association between a rapidly declining error and a shift to a more interesting input. I just have to facilitate the bootstrapping.

It eventually exhibits more and more sophisticated behavioral sequences (higher cycle before repeating) and the same at higher levels with the attentional changes.

All layers perform the same function. They only differ because of the very different "world" to which they are exposed.

3

Mental-Swordfish7129 t1_j2xrr7a wrote

>How do you achieve something similar in your binary latent space?

All data coming in is encoded into these high-dimensional binary vectors where each index in a vector corresponds to a relevant feature in the real world. Then, computing error is as simple as XOR(actual incoming data, prediction). This preserves the semantic details of how the prediction was wrong.

There is no fancy activation function. A simple sum of all connected synapses which connect to an active element.

Synapses are binary. Connected or not. They decay over time and their permanence is increased if they're useful often enough.

3

Mental-Swordfish7129 t1_j2y3l18 wrote

>I usually see attention in PP implemented, conceptually at least, as variance parameterisation/optimisation over a continuous space.

Continuous spaces are simply not necessary for what I'm doing. I avoid infinite precision because there is little need for precision beyond a certain threshold.

Also, I'm just a regular guy. I do this in my limited spare time and I only have relatively weak computational resources and hardware. I'm trying to be more efficient anyway; like the brain. It makes it all very efficient because there is not a floating point operation in sight.

Discrete space works just fine and there is no ambiguity possible for what a particular index of the space represents. In a continuous space, you'd have to worry that something has been truncated or rounded away.

Idk. Maybe my reasons are ridiculous.

2

t98907 t1_j2thvf2 wrote

The interpretability is excellent. I think the performance is likely to be lower than other state-of-the-art embedded vectors, since it looks like the context is handled by BoW.

43

Mental-Swordfish7129 t1_j2twm92 wrote

This is the big deal. Interpretability is so important and I think it will only become more desirable to understand the details of these models we're building. This has been an important design criterion for me as well. I feel like I have a deep intuitive understanding of the models I've built recently and it has helped me improve them rapidly.

20

currentscurrents t1_j2uwlrh wrote

I think interpretability will help us build better models too. For example, in this paper they deeply analyzed a model trained to do a toy problem - addition mod 113.

They found that it was actually working by doing a Discrete Fourier Transform to turn the numbers into sinewaves. Sinewaves are great for gradient descent because they're easily differentiable (unlike modular addition on the natural numbers, which is not differentiable), and if you choose the right frequency it'll repeat every 113 numbers. The modular addition algorithm worked by doing a bunch of addition and multiplication operations on these sinewaves, which gave the same result as modular addition.

This lets you answer an important question; why wasn't the network generalizable to other bases other than mod 113? Well, the frequency of the sinewaves was hardcoded into the network, so it couldn't work for any other bases.

The opens the possibility to do neural network surgery, and change the frequency to work with any base.

34

Mental-Swordfish7129 t1_j2v20d2 wrote

That's amazing. We probably haven't fully realized the great powers of analysis we have available using Fourier transform and wavelet transform and other similar strategies.

9

[deleted] t1_j2zn5o5 wrote

I think that's primarily how neural networks do their magic really. It's frequencies and probabilities all the way down

5

Mental-Swordfish7129 t1_j310xxm wrote

Yes! I'm currently playing around with modifying a Kuramoto model to function as a neural network and it seems very promising.

3

[deleted] t1_j3152ys wrote

Wellllll that seems cool as hell... Seems like steam punk neuroscience hahaha. I love it!

3

DeMorrr t1_j2t6lbq wrote

long before word2vec by mikolov et al, people in computational linguistics have been using context distribution vectors to measure word similarity. look into distributional semantics, especially the work of Hinrich Schutze in the 90s

36

Mental-Swordfish7129 t1_j2ta6bw wrote

I know right. It happens over and over. Someone's great idea gets overlooked or forgotten and then later some people declare the idea "new" and the fanfare ensues. If you're not paying close attention, you won't notice that often the true innovation is very subtle. I'm not trying to put anyone down. It's common for innovation to be subtle and to rest on many other people's work. My model rests on a lot of brilliant people's work going all the way back the early 1900s

20

SoulCantBeCut t1_j2tfupv wrote

paging jurgen schmidhuber

18

unkz t1_j2ujn5z wrote

Please don’t, I think we have all heard enough from him.

2

currentscurrents t1_j2tuq1a wrote

There's a lot of old ideas that are a ton more useful now that we have more compute in one GPU than in their biggest supercomputers.

17

Mental-Swordfish7129 t1_j2tc2qj wrote

The Tsetlin machine really is a marvel. I've often wanted to spend more time analyzing automata and FSMs like this.

17

Think_Olive_1000 t1_j2t8rie wrote

Surprised no one embeds it like CLIP but for word definition pairs rather than word image. I'm thinking take word2vec as starting point.

6

Academic-Persimmon53 t1_j2w1h9y wrote

If I didn’t understand anything what just happened where do I start to learn ?

1

olegranmo OP t1_j2w2ywn wrote

Hi u/Academic-Persimmon53! If you would like to learn more about Tsetlin machines, the first chapter of the book I am currently writing is a great place to start: https://tsetlinmachine.org

Let me know if you have any questions!

4

SatoshiNotMe t1_j2wb9xi wrote

Intrigued by this. Any chance you could give a one paragraph summary of what a Tsetlin machine is?

2

olegranmo OP t1_j2wc7vz wrote

Hi u/SatoshiNotMe! To relate the Tsetlin machine to well-known techniques and challenges, I guess the following excerpt from the book could work:

"Recent research has brought increasingly accurate learning algorithms and powerful computation platforms. However, the accuracy gains come with escalating computation costs, and models are getting too complicated for humans to comprehend. Mounting computation costs make AI an asset for the few and impact the environment. Simultaneously, the obscurity of AI-driven decision-making raises ethical concerns. We are risking unfair, erroneous, and, in high-stakes domains, fatal decisions. Tsetlin machines address the following key challenges:

  • They are universal function approximators, like neural networks.
  • They are rule-based, like decision trees.
  • They are summation-based, like Naive Bayes classifier and logistic regression.
  • They are hardware-near, with low energy- and memory footprint.

As such, the Tsetlin machine is a general-purpose, interpretable, and low-energy machine learning approach."

7