Everything we do is gonna be zero-shot from now on

Ami’s paper is now published! A protein language model has learned that proteins are made of functional units, often known as protein domains or regions.

10,000s of protein domains have been identified over the years by some of our favourite bioinformatics teams (e.g., pfam). But Ami shows that by looking at the internal representation of a protein language model, we can automatically extract the boundaries between functional units in proteins. Ami also introduces what we think is an ingenious way to visualize the automatically defined functional regions within proteins using colours. Check out the colours Owen Zhang in Li-En’s lab got for these kinesins:

Super funky. You can try it yourself here.

We use the buzzword “zero-shot” to describe this because the protein language model was trained only to fill in missing amino acids in whole protein sequences, and was never given any information about the internal functional structure of protein sequences. When language models start to do things that they weren’t trained to do, we get excited because they seem to be showing something like “intelligence”.