Machine learning in Bioinformatics- An intersection of Biology, Computer Science and Statistics: Explained easy

Entering the age of Big Data, AI, Data Science, and Information, volumes of data are being generated from various sectors other than servers and humankind, like sensors, mobile phones, IT industry, Health sectors, Medical & Biotechnology devices like MRI scanning machines, Biosensors, Microarray assays, High-performance technologies, Scientific Research, etc.

Besides, digitalization, a new trend named Internet of Things (IoT) to network all the man-made things like home appliances, cars, weapons, traffic lights, etc which communicate with each other to share the data captured through numerous sensors to predict and take intelligent operational decisions has emerged and is growing rapidly creating machine-to-machine connections and an enormous amount of data.

Considering the annual increment of generated data- it is likely to reach 44 zettabytes (44 trillion gigabytes) by the end of the year 2020 (Source: International Data Corporation, INDC_1672).

Photo by Kevin Ku on Unsplash

However, not all the data generated is seen with the utilitarian view for predictive analysis, only a small part of it is useful called the “Metadata or Target-rich data”

An evident ascending growth of biological datasets in Bioinformatics research has posed an increased demand for complex and optimizable data analytics tools, standardized big data architectures, and methods to counter 2 main setbacks comprising efficient data storage and extraction of valuable and medically relevant information from these rampant and ever-increasing datasets, which doesn’t include just the annotation but also provide relevant information in the form of testable models to obtain prognostic and predictory information, decision making and intelligent control using intersections of computer science and statistics like Artificial Intelligence(AI) or Machine learning(ML).

AI, Data Science, Machine learning, Deep learning have become the buzzwords in today’s era and are often used incorrectly and interchangeably.

Questions that come into the minds of Non-computer science graduates are: Are all these words the same? Do they have the same goals? What is the difference?

Let’s answer them all,

Artificial intelligence comprises a huge set of tools combining computer science and robust datasets for making computers behave rationally and intelligently to solve various complex and combinatorial problems mimicking the problem-solving and decision-making skills of a human mind.

It comprises various subsets like Machine learning, Deep learning, Robotics, Neural networks, etc., comprising of AI algorithms and are mentioned in conjunction with AI. Of all the subsets Machine learning is one of the most popular subsets of AI and has many applications overlapping with several other fields.

Machine learning is defined as a statistical representation of a real-world process based on volumes of datasets generated every day taking in a set of computational tools using AI algorithms to optimize a model or performance criterion from using the example data or past experience. Two main goals of ML are predicting future events like Weather forecast, Next move in the game, Robot deciding its path, etc., and infer the causes of events, various patterns of occurrence, and its behavior. It is an interdisciplinary mix of Statistics and Computer Science and doesn’t require explicit programming. ML models learn patterns from existing data and apply them to new data and to make accurate predictions it needs high-quality data.

Existing data is loaded inside a ML model to generate predicted and descriptive data (Source: Image by Author)

AI & Data Science share different goals, AI refers to the intelligence of Computers whereas Data Science is about using data to discover and communicate insights from data. ML is an important tool that uses AI algorithms to perform Data science-related works to transform data into relevant knowledge iteratively and interactively. Thus, making a clear distinction between these three terms.

Source: Image by Author

There are 3 types of Machine Learning Models:

  1. Reinforcement Learning: ML models are used to decide sequential actions in potentially complex environs requiring complex mathematical calculations, powerful computer infrastructure, and preparation of simulation environment depending on the task to perform. The computer uses trial and error to come up with solutions to such complex problems allowing the machine to learn from its errors. AI algorithms get rewards or penalties based on the action it performs. For example Game-like situations, Autonomous cars, Deciding chess moves, etc.

Before coming to the other two types, let us first understand how the data is characterized in a dataset:

In a Dataset, the data is categorized into the Target Variable, Labels for Target Variable, and Features. The target variable is the one we want to predict, Labels can be values in form of numbers or categories, for eg., True/False or Yes/No, etc. Features are different pieces of information recorded as observations and are used to predict the target variable. Machine learning models analyze many features at a time to find the relationship between different features. We input Labels and Features as data to train a model.

An example of Thyroid disease dataset comprising categorization of the datasets into Features, Target Variable and Labels(Source: Image by Author)

2. Supervised Learning: In Supervised Machine Learning models the training data is labeled. We input labels and features like age, family history, comorbid conditions to train the model and perform predictive studies.

3. Unsupervised Learning: In Unsupervised Machine Learning models the training data is not labeled, it has features only. Referring to a common example of datasets containing information of Thyroid disease patients. We know that every patient responds differently to different treatments. So, we can use Unsupervised ML models to understand different “Types/Categories” of patients bypassing the feature observations to the Clustering model to get the Categories of patients based on features similarity. Thus, grouping patients and researching better treatments for each category.

Now, if a new patient comes, then we can input the features into the Unsupervised ML model and get which patient “Type” they fit in and prescribe the treatment accordingly without much delay. In the real world, data doesn’t always come with labels and labeling requires a lot of manual labor. Therefore, unsupervised learning models are frequent and preferable as there are no labels & the model finds its patterns.

Unsupervised learning is mainly useful for Anomaly detection and Clustering which is the division of data into groups based on similarity.

AL/ML Applications in Biology

AI/ML techniques have applications in various domains of biological research and development for extraction of relevant knowledge, predictive & descriptive analysis like Genomics, Proteomics, Text mining, Systems, and Structural Biology, Microarrays, etc.

Coming to Genomics, it is one of the most popular and important areas of Computational Biology, keeping in view the increasing amount of genomic sequences, a vast amount of bioinformatics tools are required to process the data to generate useful information like position and gene structures, identification of non-coding RNA genes and gene regulatory elements. Computational AI/ML models are employed to predict gene functions of newly discovered genes and RNA secondary structures.

Proteomics involves the large-scale study of proteins which are the vital parts of living organisms having a variety of functions to transform the information contained in the genes into life and require computational & statistical applications like Hidden Markov Models, ML/AI, Neural networks to solve very complex and combinatorial protein 3D structure prediction which is further used to get insights into the functionality of proteins and prediction.

It also has applications in Systems Biology where Biology and Machine learning amalgamates and AI/ML models are used to model the life processes taking place inside the cell, biological, genetic, and metabolic networks and signal transduction pathways.

Various evolutionary studies make use of machine learning and statistical approaches for Phylogenetic tree construction. Phylogenetic trees are the schematic representation of evolutionary relationships among organisms reflecting how different species have evolved from a series of common ancestors based on different features like morphology, metabolism, etc.

Computational models are also used for the management and analysis of complex and large experimental datasets from Microarray assays which entails pre-processing forming training data (Existing data to learn from) to train and build the ML models followed by analysis. Microarray data has applications in expression pattern identification, genetic and metabolic network induction, and classification.

Text Data Mining involves transforming unstructured text into a standardized and structured data format which is easier to store and process for analysis and machine learning algorithms to get insights and meaningful patterns using complex and advanced analytical techniques like deep learning algorithms, Support Vector Machines(SVM), Naive Bayes, etc to explore the hidden relationships within the unstructured data.

Concluding Remarks

One of the challenges faced by Bioinformaticians like us using the present day big data analysis tools are that they operate in batch-mode, are very slow and has almost no optimization for iterative processing and have high data dependency among the operations.

But the instigation of multi-view Machine learning algorithms have impeccably slumped the limitations, I/O cost and increased iterative processing. ML tools are one of the most frequently used and promising in Bioinformatics Data Analysis to analyze data both at the small and large scale using various techniques like sampling, distributed computations and feature selection. A lot of efforts are being made to extend Big Data in Biology using incremental, parallel, and multi-view clustering machine learning models to handle complex bioinformatics problems and make the research efficient and cost effective.

I personally feel, that we biology researchers need to learn the main concept behind the working of AI/ML models and get our hands into Computer science and Statistics in order to get the valuable insights and tackle present problems with an interdisciplinary approach. On the whole, I came across a lot of articles on the web pertaining to AI/ML but most of them were extensively loaded with technical words and complex definition. After, a good amount of time, search and YouTube videos, at last I understood basic concept and wanted to share it in an easy language. This article was to introduce everyone with the basics of AI/ML and what is the the difference between them.

Attaching links to my favorite articles and YouTube videos below:

  1. Machine Learning Basics | What is Machine Learning | Introduction to Machine Learning by Simplilearn :
  2. AI VS ML VS DL VS Data Science by Krish Naik:
  3. Larranaga P et al.,(2006), Machine Learning in Bioinformatics, Briefings in Bioinformatics Vol 7, Issue 1:
  4. Naresh E et al., (2020), Impact of Machine Learning in Bioinformatics Research, e-Chapter:


Original post:

Leave a Reply

Your email address will not be published. Required fields are marked *