Identifying Antibiotic-Resistant Bacteria (ML binary problem)

Identifying Antibiotic-Resistant for different Bacterian diseases is one of the most interesting real-world problems, which in most cases is a binary classification with zero and one value as the target (whether there is a resistance or not). If we want to solve this problem with machine learning models, we need a clean dataset with class labels, which would be a supervised binary classification problem. In this study, I will introduce an example of this problem and try to predict the resistance with a classifier.

For this study, I consider an example of Neisseria gonorrhoeae bacteria. Most bacteria diseases are transmitted infection easily. In this specific case, the disease mostly has no symptoms, and it helps the disease to spread. The resistance of different bacteria to their antibiotics is rising over time and makes the treatment more difficult.

There are different available antibiotics for this treatment such as ciprofloxaxcin, azithromycin, and ceftriaxone. These antibiotics during times failed too many times due to resistance to the drug becoming too common. In the following, you can see the collected data distribution in years.

Data distribution

To build a proper dataset and predict the resistance, you need to have the antibiotics and their DNA sequences. To build the model dataset, we use unitigs [1], which is an approach to present the DNA variation in bacteria.

As for the class labels, I mentioned that we have a binary class classification problem with zero and one.

class target distribution

We can build the dataset for a different antibiotic, for instance, Azithromycin with azm_sr as the target class. For instance, in the following, you can find the resistance of Azithromycin for different continents and features.

Feature relationship

To build a dataset, we have to apply data wrangling and combine the target labels with DNA sequences of bacteria, which is a logical matrix with zero and one value for each feature and each feature represents the DNA character.

To download the related python code to this article, you can check my kaggle profile, here.

The Machine Learning goal would be to build a model to predict the resistance of a bacteria to a specific antibiotic. Another important aspect of this kind of study, besides the target prediction, is to know the importance of each feature for further analysis. For instance, in the following, you can observe the importance of different DNA sequences with regard to the azithromycin antibiotic as the class label. Note that the feature importance analysis will be after training the machine learning model. You can apply any classifier for this problem.

Feature importance

A simple flowchart of the told process is projected as follows.



1- Jaillard, Magali, et al. “A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events.” PLoS genetics 14.11 (2018): e1007758.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store