Predicting Diseases Before they Happen

Leveraging Genetics via AI

8 min readMay 2, 2020

Humans constantly strive for certainty. Where are we getting our food? What university will I go to? When will I be able to live my normal life again? All questions we want to be certain about.

Recently, we have realized how much we rely on certainty. Freaking out in the face of uncertainty due to COVID. We only become uncertain when we are exposed to something, and we don’t know about it. If we are never exposed to something, we don’t worry about it.

You have a large mole on your body. It’s there. You're constantly wondering if it is cancer. Uncertainty.
Your life is normal. Certainty. Three weeks later, your doctor tells you, “There has been cancer growing in your body for the past month.” How long do I have to live? Could we have recognized it earlier? What will I do with the rest of my time? Uncertainty.

For our mental health, it’s great we don’t question everything constantly. At the same time, being cautious can lead to early intervention, saving lives. Complete certainty regarding life and death will never exist. But…

We have the ability to be proactive. Recognize diseases early and start treatment. We have just been extremely reactive. I’m certain about that.

Our Approach for Disease Treatment

We have always been reactive. Only when a symptom shows up will we take action, costing lives.

The answers already exist. We don’t have to wait for surprises to come to us. It’s all in our own genetic data. We must start analyzing our individual data, draw conclusions, and become more certain about future challenges our body will throw at us.

The future: “Heart disease. Toooootally didn’t see that coming.”

A Genomic Overview

All of us have a specific genome containing our DNA which is wrapped up in chromosomes. DNA is made up of four letters (nucleotides) that pair up(A — T, C — G). These letters code for amino acids. Amino acids code for proteins. Proteins carry out all our bodily functions.

Everyone’s DNA is 99% similar. What makes us different are small variations in our code (eg. a T instead of a G). Our DNA may also differ when mutations occur. Insertions, deletions, and duplications of a letter when copying DNA can cause our body to make unwanted proteins.

We are interested in these genomic variations and mutations.

Left: Mutation, Middle: Deletion, Right: Duplication

SNPs: Single Nucleotide Polymorphisms

A SNP is a single difference in one gene that is present in more than 1% of the population. For example, a Cytosine nucleobase in a certain gene may exist in 20% of the population, whereas the other 80% have an Adenine. This could be the difference between having a blue or brown eye colour.

On average, we have a frequency of one SNP per 1000 nucleobases. Most SNPs occur in between our genes, where we believe they have no effect on our bodies. Rather acting as a biological marker, helping researches locate genes associated with diseases.

We attempt to discover new SNPs and assess their effect on certain diseases through Genome-Wide Association Studies (GWAS).

The cost to sequence genomes (understand the order of all the nucleobases) is significantly decreasing, leading to more studies. And greater advancements in precision medicine — how we can tailor medicine to how your body specifically reacts.

There are Two Types of Diseases

Simple Diseases — Only one SNP causes the disease (ex. sickle cell anemia, cystic fibrosis). Since there is only one variant, the opportunity to use emerging technologies like CRISPR-CAS 9 to cure the disease is a possible future.

2. Complex Diseases — Multiple SNPs contribute to one disease (ex. Cancers, heart diseases, diabetes, schizophrenia, Alzheimer's). It’s much harder to track down what SNPs are causing the problem and therefore treat it.

We’re Interested in Complex Diseases

Deaths by:

Heart Disease — 17.9 million
Cancer — 10 million
Diabetes — 1.6 million

That’s a lot of lives we can save. For comparison: 4.4 million people have sickle cell anemia.

Current Approach of Looking at SNPs to Diagnosis Cancer

We have noticed mutations in the TP53 and VHL genes, both are tumour suppressors which help control the rate of cell division and growth. Specific SNPs affect certain diseases: NCOR1, GATA3 are genes correlated with breast cancer. It’s the same for non-cancer diseases.

Problem: As we look at more and more cases regarding SNPs and haplotypes (two SNPs found in the same chromosome), we are unable to conclude a mutation/variant in these genes causes cancer.

Environmental factors and SNP combinations are contributing to cancer more and more. Even in high-risk patients. Again, complex diseases are based on multiple SNPs and their contributing effect. We can’t look at individual mutations to assess someone's risk for developing any complex disease.

There may be a mutation in a cancer-correlated gene. But, the gene may never be expressed unless the patient lives in a high polluting environment. It may depend if someone has another specific SNP or their diet.

Nutrition, Lifestyle and Location can all play a factor in developing a complex disease

The New Kid on the Block: Polygenic Risk Scoring

By harnessing genomic data and a few other factors, a fixed algorithm is created to assess someone’s risk of getting a complex disease. Illumina and 23 and Me are both looking into polygenic risk scoring to aid with early diagnosis.

Stage 1: Validation

Given a control and experimental group, data on SNPs is collected via GWAS or a biobank. The algorithm then sums the contribution of specific SNPs to a certain disease, observed through the data.

A fixed model can only take in so much data, bringing us to problem #1, we have to pick and chose SNPs to include and cut out. This is done through an odds ratio. SNPs that have an odd ratio > 1.3 are included in the study. Meaning we can miss out on possibly important SNPs.

The odds ratio tells us how large a role a variant plays into getting a certain disease. Again, only measuring a gene’s individual impact.

With the key SNPs and their impact, we can create a receiving operator characteristic curve (ROC). This plots individual points for the false-positive and true-positive rates from a confusion matrix given certain thresholds🤔.

Key Takeaway: Given a ROC curve, we can determine the area under the curve (AUC). The farther to the right you are on this curve, the higher chance you have of developing that disease.

An ROC curve — By taking exact data points, we can more accurately predict diseases

Gray Line — AUC representing someone's risk of getting coronary artery disease. The extremes have a much smaller population, the majority of the world will not be at a huge risk.

Algorithms can either be:

Weighted (the better one)— Certain SNPs have a larger contribution to getting a disease
Unweighted — All SNPs are considered to have the same impact

Stage 2: Validation

Algorithm Designed ✅, Represented by ROC Curve ✅. Now we test on the experimental group, see how well the algorithm did and make tweaks.

Problems

The current accuracy for assessing complex diseases with polygenic risk scoring is 60–70%. That Sucks.

The algorithm can’t take in multi-dimensional data: SNP combination, environment, nutritional and family history factors are excluded in the prediction
Only so many SNPs can be included in the algorithm

The Bigger and Better Kid on the Block: Machine Learning

Machine Learning, a subfield of AI, is all about giving machines human intelligence. Instead of coding for every scenario, we want to understand the key features that contribute to certain diseases. Mapping the connection of certain SNPs and factors to their impact in complex diseases.

Machine learning is able to take in multi-dimensional data (SNPs + other factors), leading to much more accurate predictions. Deep Learning and Support Vector Machines (SVM) have been the two most promising models.

SVMs had an 84% accuracy at diagnosing at diabetes compared to polygenic risk scoring’s 71%! An artificial neural network (ANN) was able to diagnosis obesity with a 99% accuracy. CRAZY!

The Machine Learning Process

Data Collection — From GWAS or a Biobank
Data Cleaning — Remove extremely rare SNPs, feature selection (what inputs/data points should be considered in our prediction)
Select a model — SVMs, Neural Networks, Logistic Regression etc.
Create the predictor — This is an example of a neural network model. Individually SNPs, SNP combinations, and factors will be inputted. The model will predict if the patient has the disease or not. Constantly adjusting, the NN will determine the most important factors and SNPs, weighing then respectively. The results will be presented on an AUC graph.

5. Test — Once again we want to validate our neural network, assess if there may be inconsistencies in the data or if another model would perform better.

The Future

We can use our genomic data and machine learning to predicts one risk of getting a certain disease🤯.

Implications and Additions:

Early Treatment — In diseases like cancers, catching the spread early is key and can be the difference between life and death
Precision Medicine — We will be able to make a specific diagnosis based on your genetic code and lifestyle
Pharmacogenomics — Determining how people will react to certain medicines, will there be harmful side effects for certain people? Will certain medicines work better on people with certain SNPs
Epigenetics — How does the expression of different gene variants affect their impact on complex diseases?
Certainty — No more surprises!

You wake up and spit into a tube. Using next-generation sequencing or microarrays, we see any SNPs or mutations that exist in your genome. That data is analyzed by a machine learning algorithm…You don’t have cancer. Breath.

Let’s make that a reality.

Key Takeaways:

Complex diseases are caused by multiple variations in our genome
Complex diseases cost millions of lives every year
We can analyze our genomic data with AI to assess our risk at developing a certain disease
That information can help personalize treatment, start interventions earlier and save lives
None of this has happened before because we are always reactive: we look at symptoms rather than genetic data and use inaccurate tools to read the enormous amount of data we have

Before You Go

Connect with me on LinkedIn
Check out my Personal Website
Check out my YouTube
You can reach out to me at adamomarali37@gmail.com for any questions or if you want to chat!