Submitted by xutw21 t3_yjdt78 in MachineLearning

Paper: https://www.biorxiv.org/content/10.1101/2022.07.20.500902v2

Meta's Tweet: https://twitter.com/MetaAI/status/1587467591068459008

Abstract

>Artificial intelligence has the potential to open insight into the structure of proteins at the scale of evolution. It has only recently been possible to extend protein structure prediction to two hundred million cataloged proteins. Characterizing the structures of the exponentially growing billions of protein sequences revealed by large scale gene sequencing experiments would necessitate a breakthrough in the speed of folding. Here we show that direct inference of structure from primary sequence using a large language model enables an order of magnitude speed-up in high resolution structure prediction. Leveraging the insight that language models learn evolutionary patterns across millions of sequences, we train models up to 15B parameters, the largest language model of proteins to date. As the language models are scaled they learn information that enables prediction of the three-dimensional structure of a protein at the resolution of individual atoms. This results in prediction that is up to 60x faster than state-of-the-art while maintaining resolution and accuracy. Building on this, we present the ESM Metagenomic Atlas. This is the first large-scale structural characterization of metagenomic proteins, with more than 617 million structures. The atlas reveals more than 225 million high confidence predictions, including millions whose structures are novel in comparison with experimentally determined structures, giving an unprecedented view into the vast breadth and diversity of the structures of some of the least understood proteins on earth.

117

Comments

You must log in or register to comment.

timy2shoes t1_iunym8r wrote

We've been testing out their embeddings for transfer learning tasks and they've been performing quite well. Better than previous embeddings that we have tested. The 15B parameter model though is a pain in the ass. Getting the embeddings requires a workaround that is difficult to implement. Probably not worth it in my opinion.

16

nivrams_brain t1_iuo3l24 wrote

What kind of downstream tasks are you looking at?

7

timy2shoes t1_iuo4xa4 wrote

ML-guided protein engineering.

8

ROFLLOLSTER t1_iupsskm wrote

> requires a workaround that is difficult to implement

What workaround? I've also been working with ESM and tried the 15B parameter variant. It seemed worse than the 3B in my tests, but maybe I just missed the problem?

2

timy2shoes t1_iuptv7y wrote

We had to do a workaround to fit the 15b parameter model on a p3.8xlarge instance.

> I've also been working with ESM and tried the 15B parameter variant.

Huh. We’ve noticed the same thing. Interesting that others are having the same problem.

2

Mister_Abc t1_iur4gme wrote

First author here. We've had some indication that the 15B model may be overfit. It seemed to sightly improve on a few important metrics (casp14) which is why we included it.

2

farmingvillein t1_iuowrov wrote

I'm not sad that they are doing this, in the sense that it is almost certainly net-good for humanity, but it is bizarre to me that MetaAI is investing here.

6

OnceReturned t1_iupkp7o wrote

This is all working towards engineering proteins from scratch to do whatever you want. The potential impact of engineered proteins over the next hundred years is on the order of the impact of computers over the past hundred years. Meta and Alphabet and some others get this. The problem has two basic challenges:

Pick a biochemical function you want.

  1. What structure provides that function?

  2. What amino acid sequence yields that structure?

We're getting closer to figuring out the second thing with these structure prediction models. Once you can reliably answer those two questions, the world is your oyster. Want to catalyze hundreds of the most valuable reactions used in industrial chemical production, thereby lowering cost, increasing efficiency, increasing yield, and even opening entirely new avenues of chemical engineering? You can. Want to develop new classes of drugs to effectively treat hundreds of the highest priority diseases? You can. Want cheap sensors that can detect anything? Want to engineer perfect crops? Want to turn waste into fuel? Want to cheaply and easily construct and repair polymers? Want to make complex metamaterials? Want real, sophisticated nanotechnology? The list goes on, well into the unimaginable. And, once you can answer the two questions, it's super cheap to make arbitrary amino acid sequences.

Figuring it out would be like discovering fire for the first time. It's especially interesting because it will almost certainly happen and be virtually perfected within the next couple decades (at the latest, IMO).

19

farmingvillein t1_iupoett wrote

To be super clear, I'm not questioning the overall utility! Strictly a statement of, I can't square this with metas mission statement.

6

OnceReturned t1_iupwvi9 wrote

That's fair.

If I were someone with billions of dollars to burn on whatever moonshot R&D I could think of, it would, at least in large part, be on this stuff. So, I'm more inclined to wonder why everybody isn't working on it.

4

le4mu t1_iuqqb4g wrote

How is the progress on the first question? The first question seems a fairy tale IMHO, but maybe because I am not in this domain. Could you provide more insights?

1

ynonym00s t1_iusru1v wrote

@OnceReturned: These are naturally occurring proteins, no? For 2. to be solved, we would need to be able to predict structures for artificial sequences too? Moreover, don't we still need to predict structures in-vivo (inside the organism /environment where they are used)?

1

seraschka t1_iusvmp1 wrote

This is super awesome stuff! But I would put a little asterisk on this for now. To get an idea of its real, unbiased accuracy, I wonder if they participated in CASP15 which is essentially the gold standard for assessing structure predictions. I think results will be released in December ... I guess we will know more about this next month.

2

Lone-Pine t1_iusirtw wrote

How is this different from AlphaFold?

1

gwyddonydd t1_iut86s5 wrote

Quicker to run than AlphaFold but produces significantly less accurate models on average. For the very easiest cases they are probably roughly on par, though. To be honest, the speedup isn't really worth the loss in accuracy, especially when we already have a database of 230 million or so AlphaFold models to refer to.

2