Logical Word Embedding with Tsetlin Machine Autoencoder

Here is a new self-supervised machine learning approach that captures word meaning with concise logical expressions. The logical expressions consist of contextual words like “black,” “cup,” and “hot” to define other words like “coffee,” thus being human-understandable. I raise the question in the heading because our logical embedding performs competitively on several intrinsic and extrinsic benchmarks, matching pre-trained GLoVe embeddings on six downstream classification tasks. You find the paper here: https://arxiv.org/abs/2301.00709, an implementation of the Tsetlin Machine Autoencoder here: https://github.com/cair/tmu, and a simple word embedding demo here: https://github.com/cair/tmu/blob/main/examples/IMDbAutoEncoderDemo.py

Comments

You must log in or register to comment.

Mental-Swordfish7129 t1_j2s6xlg wrote on January 3, 2023 at 4:16 PM

#1,270,758

Interesting. I've had success encoding the details of words (anything, really) using high-dimensional binary vectors. I use about 2000 bits for each code. It's usually plenty as it is often difficult to find 2000 relevant binary features of a word. This is very efficient for my model and allows for similarity metrics and instantiates a truly enormous latent space.

[deleted] t1_j2s7oa7 wrote on January 3, 2023 at 4:21 PM

#1,270,795

[removed]

clauwen t1_j2sucur wrote on January 3, 2023 at 6:44 PM

#1,271,792

Replying to Mental-Swordfish7129 (#1,270,758)

Maybe im an idiot, but depending on precision, this is not much smaller of an encoding, as a lot of other model use, right? And none of the state of the art embedding models are at all optimized for space, right?

Mental-Swordfish7129 t1_j2t17wy wrote on January 3, 2023 at 7:26 PM

#1,272,080

Replying to clauwen (#1,271,792)

Idk much about other encoding systems. This works well for my purposes. It's scalable. I look at my data and ask, "how many binary features of each datum are salient and also which features are important to the model for judging similarities"? 2000 may be too much sometimes. Also, remember that a binary vector is often handled as an integer array indicating the index of bits set to 1. If your vectors are sparse it can be very efficient. For the AI models I build, my vectors are often quite sparse because I often use a scheme like a "slider" of activations for integer data; sort of like "one hot", but I'll do three or more consecutive to encode associativity.

Mental-Swordfish7129 t1_j2t22qq wrote on January 3, 2023 at 7:31 PM

#1,272,114

Replying to clauwen (#1,271,792)

The biggest reason I use this encoding is because of the latent space it creates. My AI models are of the SDM variety with a predictive processing architecture computing something very similar to active inference. This encoding allows for complete universality and the latent space provides for the generation of semantically relevant memory abstractions.

DeMorrr t1_j2t6lbq wrote on January 3, 2023 at 7:58 PM

#1,272,312

long before word2vec by mikolov et al, people in computational linguistics have been using context distribution vectors to measure word similarity. look into distributional semantics, especially the work of Hinrich Schutze in the 90s

Think_Olive_1000 t1_j2t8rie wrote on January 3, 2023 at 8:12 PM

#1,272,402

Surprised no one embeds it like CLIP but for word definition pairs rather than word image. I'm thinking take word2vec as starting point.

Mental-Swordfish7129 t1_j2ta6bw wrote on January 3, 2023 at 8:20 PM

#1,272,464

Replying to DeMorrr (#1,272,312)

I know right. It happens over and over. Someone's great idea gets overlooked or forgotten and then later some people declare the idea "new" and the fanfare ensues. If you're not paying close attention, you won't notice that often the true innovation is very subtle. I'm not trying to put anyone down. It's common for innovation to be subtle and to rest on many other people's work. My model rests on a lot of brilliant people's work going all the way back the early 1900s

Mental-Swordfish7129 t1_j2tc2qj wrote on January 3, 2023 at 8:31 PM

#1,272,579

The Tsetlin machine really is a marvel. I've often wanted to spend more time analyzing automata and FSMs like this.

[deleted] t1_j2tcmou wrote on January 3, 2023 at 8:35 PM

#1,272,614

Replying to Mental-Swordfish7129 (#1,272,464)

[deleted]

SoulCantBeCut t1_j2tfupv wrote on January 3, 2023 at 8:54 PM

#1,272,768

Replying to Mental-Swordfish7129 (#1,272,464)

paging jurgen schmidhuber

t98907 t1_j2thvf2 wrote on January 3, 2023 at 9:06 PM

#1,272,850

The interpretability is excellent. I think the performance is likely to be lower than other state-of-the-art embedded vectors, since it looks like the context is handled by BoW.

maizeq t1_j2to73g wrote on January 3, 2023 at 9:44 PM

#1,273,142

Replying to Mental-Swordfish7129 (#1,272,114)

What type of predictive processing architecture exactly if you don’t mind saying?

Mental-Swordfish7129 t1_j2tubij wrote on January 3, 2023 at 10:22 PM

#1,273,485

Replying to maizeq (#1,273,142)

It's pretty vanilla.

Message passing up is prediction error.

Down is prediction used as follows:

I use the bottom prediction to characterize external behavior.

Prediction at higher levels characterizes attentional masking and other alterations to the ascending error signals.

currentscurrents t1_j2tuq1a wrote on January 3, 2023 at 10:25 PM

#1,273,505

Replying to Mental-Swordfish7129 (#1,272,464)

There's a lot of old ideas that are a ton more useful now that we have more compute in one GPU than in their biggest supercomputers.

Mental-Swordfish7129 t1_j2twm92 wrote on January 3, 2023 at 10:37 PM

#1,273,583

Replying to t98907 (#1,272,850)

This is the big deal. Interpretability is so important and I think it will only become more desirable to understand the details of these models we're building. This has been an important design criterion for me as well. I feel like I have a deep intuitive understanding of the models I've built recently and it has helped me improve them rapidly.

unkz t1_j2ujn5z wrote on January 4, 2023 at 1:13 AM

#1,274,758

Replying to SoulCantBeCut (#1,272,768)

Please don’t, I think we have all heard enough from him.

currentscurrents t1_j2uwlrh wrote on January 4, 2023 at 2:47 AM

#1,275,479

Replying to Mental-Swordfish7129 (#1,273,583)

I think interpretability will help us build better models too. For example, in this paper they deeply analyzed a model trained to do a toy problem - addition mod 113.

They found that it was actually working by doing a Discrete Fourier Transform to turn the numbers into sinewaves. Sinewaves are great for gradient descent because they're easily differentiable (unlike modular addition on the natural numbers, which is not differentiable), and if you choose the right frequency it'll repeat every 113 numbers. The modular addition algorithm worked by doing a bunch of addition and multiplication operations on these sinewaves, which gave the same result as modular addition.

This lets you answer an important question; why wasn't the network generalizable to other bases other than mod 113? Well, the frequency of the sinewaves was hardcoded into the network, so it couldn't work for any other bases.

The opens the possibility to do neural network surgery, and change the frequency to work with any base.

Mental-Swordfish7129 t1_j2v20d2 wrote on January 4, 2023 at 3:27 AM

#1,275,808

Replying to currentscurrents (#1,275,479)

That's amazing. We probably haven't fully realized the great powers of analysis we have available using Fourier transform and wavelet transform and other similar strategies.

Academic-Persimmon53 t1_j2w1h9y wrote on January 4, 2023 at 9:42 AM

#1,277,301

If I didn’t understand anything what just happened where do I start to learn ?

olegranmo OP t1_j2w2ywn wrote on January 4, 2023 at 10:02 AM

#1,277,350

Replying to Academic-Persimmon53 (#1,277,301)

Hi u/Academic-Persimmon53! If you would like to learn more about Tsetlin machines, the first chapter of the book I am currently writing is a great place to start: https://tsetlinmachine.org

Let me know if you have any questions!

maizeq t1_j2w7p8k wrote on January 4, 2023 at 11:04 AM

#1,277,526

Replying to Mental-Swordfish7129 (#1,273,485)

Is this following a pre-existing methodology in the literature or something custom for your usage? I usually see attention in PP implemented, conceptually at least, as variance parameterisation/optimisation over a continuous space. How do you achieve something similar in your binary latent space?

Sorry for all the questions!

SatoshiNotMe t1_j2wb9xi wrote on January 4, 2023 at 11:47 AM

#1,277,686

Replying to olegranmo (#1,277,350)

Intrigued by this. Any chance you could give a one paragraph summary of what a Tsetlin machine is?

olegranmo OP t1_j2wc7vz wrote on January 4, 2023 at 11:57 AM

#1,277,736

Replying to SatoshiNotMe (#1,277,686)

Hi u/SatoshiNotMe! To relate the Tsetlin machine to well-known techniques and challenges, I guess the following excerpt from the book could work:

"Recent research has brought increasingly accurate learning algorithms and powerful computation platforms. However, the accuracy gains come with escalating computation costs, and models are getting too complicated for humans to comprehend. Mounting computation costs make AI an asset for the few and impact the environment. Simultaneously, the obscurity of AI-driven decision-making raises ethical concerns. We are risking unfair, erroneous, and, in high-stakes domains, fatal decisions. Tsetlin machines address the following key challenges:

They are universal function approximators, like neural networks.
They are rule-based, like decision trees.
They are summation-based, like Naive Bayes classifier and logistic regression.
They are hardware-near, with low energy- and memory footprint.

As such, the Tsetlin machine is a general-purpose, interpretable, and low-energy machine learning approach."

SatoshiNotMe t1_j2wotox wrote on January 4, 2023 at 1:57 PM

#1,278,343

Replying to olegranmo (#1,277,736)

Appreciate this! Will have to dig into your book

Mental-Swordfish7129 t1_j2x3juw wrote on January 4, 2023 at 3:43 PM

#1,279,129

Replying to maizeq (#1,277,526)

Idk if it's in the literature. At this point, I can't tell what I've read from what has occurred to me.

I keep track of the error each layer generates and also a brief history of its descending predictions. Then, I simply reinforce the generation of predictions which favor the highest rate of reduction in subsequent error. I think this amounts to a modulation of attention (manifested as a pattern of bit masking of the ascending error signal) which amounts to ignoring the portions of the signal which have low information and high variance.

At the bottom layer, this is implemented as choosing behaviors (moving a reticle over an image u,d,l,r) which accomplish the same avoidance of high variance and thus high noise, but seeking high information gain.

The end result is a reticle which behaves like a curious agent attempting to track new, interesting things and study them a moment before getting bored.

The highest layers seem to be forming composite abstractions on what is happening below, but I have yet to try to understand.

I'm fine with questions.

Mental-Swordfish7129 t1_j2xqlwa wrote on January 4, 2023 at 6:09 PM

#1,280,308

Replying to maizeq (#1,277,526)

The really interesting thing as of late is that if I "show" the agent, as part of its input, its global error metric alongside forcing (moving the reticle directly) it out of boredom toward higher information gain, I can eventually stop the forcing because it learns to force itself out of boredom. It seems to learn the association between a rapidly declining error and a shift to a more interesting input. I just have to facilitate the bootstrapping.

It eventually exhibits more and more sophisticated behavioral sequences (higher cycle before repeating) and the same at higher levels with the attentional changes.

All layers perform the same function. They only differ because of the very different "world" to which they are exposed.

Mental-Swordfish7129 t1_j2xrr7a wrote on January 4, 2023 at 6:16 PM

#1,280,355

Replying to maizeq (#1,277,526)

>How do you achieve something similar in your binary latent space?

All data coming in is encoded into these high-dimensional binary vectors where each index in a vector corresponds to a relevant feature in the real world. Then, computing error is as simple as XOR(actual incoming data, prediction). This preserves the semantic details of how the prediction was wrong.

There is no fancy activation function. A simple sum of all connected synapses which connect to an active element.

Synapses are binary. Connected or not. They decay over time and their permanence is increased if they're useful often enough.

Mental-Swordfish7129 t1_j2y3l18 wrote on January 4, 2023 at 7:27 PM

#1,280,848

Replying to maizeq (#1,277,526)

>I usually see attention in PP implemented, conceptually at least, as variance parameterisation/optimisation over a continuous space.

Continuous spaces are simply not necessary for what I'm doing. I avoid infinite precision because there is little need for precision beyond a certain threshold.

Also, I'm just a regular guy. I do this in my limited spare time and I only have relatively weak computational resources and hardware. I'm trying to be more efficient anyway; like the brain. It makes it all very efficient because there is not a floating point operation in sight.

Discrete space works just fine and there is no ambiguity possible for what a particular index of the space represents. In a continuous space, you'd have to worry that something has been truncated or rounded away.

Idk. Maybe my reasons are ridiculous.

[deleted] t1_j2zn5o5 wrote on January 5, 2023 at 1:21 AM

#1,283,482

Replying to Mental-Swordfish7129 (#1,275,808)

I think that's primarily how neural networks do their magic really. It's frequencies and probabilities all the way down

Mental-Swordfish7129 t1_j310xxm wrote on January 5, 2023 at 8:46 AM

#1,285,912

Replying to [deleted] (#1,283,482)

Yes! I'm currently playing around with modifying a Kuramoto model to function as a neural network and it seems very promising.

[deleted] t1_j3152ys wrote on January 5, 2023 at 9:41 AM

#1,286,063

Replying to Mental-Swordfish7129 (#1,285,912)

Wellllll that seems cool as hell... Seems like steam punk neuroscience hahaha. I love it!