RebornSage

RebornSage t1_j69hlhe wrote

Well, that's a questions I can answer! I have been working for more than ten years in the field of ML and omics data.

First, definitions (aka nit picking):

  • Machine Learning is a part of Artificial Intelligence.
  • Genomics means to study the DNA and all genes (the genome) of an organism. There are other "omics" like Transcriptomics and Metabolomics.

One straightforward example is the detection of changes in the DNA (mutations) that cause a certain disease or lead to a phenotype of interest. Possible research questions are: Which mutations allow modern humans to digest milk as adults? Or: What is the ideal combination of gene variations to choose to breed a drought resistant corn variety?

A popluar approach to finde those mutations is called Genome-wide association study (GWAS). Basically, you collect a big table of every mutation you can detect in thousands of individuals. Additionally, you record the disease state (digest milk yes/no) or observed phenotype (corn yield under moderate drought):

Individual Mutation 1 Mutation 2 ... Phenotype
A123 A A ... 23
B456 G A ... 42
... ... ... ... ...

One option is to run a statistical test for separately for each mutation and see if there is any significant relation to the phenotype. This has been done in the past with success. However, this approach misses cases where two or more mutations are necessary to produce a phenotype. Say, humans have two copies of a gene and a for an observable disease to occur both need to be damaged by a mutation (e.g. introduction of a stop codon). To catch such cases, we can use (simple) Machine Learning models. Again, there are many options. One is a constrained linear model called Ridge regression. We encode the mutations using numbers and train a model to predict the phenotype based on those numbers. During the training, the model finds patterns of mutations that are best suited to predict the phenotype as accurately as possible. There are many caveats I skip over here. Afterwards, we can inspect the model to extract these patterns and thereby find the responsible mutations (and the genes they affect). However, these may or may not be the causal mutations. They are "just" predictive - or correlating with the phenotype if you like. Nevertheless, these genes could be a starting point to develop a new drug or serve as biomarkers for a diagnostic test.

That's very briefly ML in genomics.

Throwing in an example paper for good measure: Genome-wide association study and genomic selection for yield and related traits in soybean - this is basically the corn example from above.

10