Breast Cancer Microarray Predictor

Breast cancer is called as an uncontrolled growth of breast cells that mostly occurs in women and rarely in men. In Latino America and the Caribbean for the year 2018, there were about 200.000 cases, of which approximately 50.000 died.

This is a Machine Learning Project which uses a dataset that contains a microarray with more than 54.000 gene expressions of breast cancer. The target was to develop an algorithm capable of analyze and predict different breast cancer types, such as luminal A, luminal B, HER, Basal, Cell Line and Normal Tissue.

Exploratory Data Analysis (EDA)

For this case I used dimensionality reduction techniques in order to graph the dispersion of the 6 classes in 2 dimensions, such as Principal Components Analysis (PCA).

PCA

Models performance

In this section I evaluated two different cases. The first one was using all the gene expressions in the dataset, and the second one was using feature selection, I tested the performance of different machine learning models but using only 153 gene expressions, resulting in better evaluation models metrics and a remarkable reduction in models training time.

54.675 Genes

In this case, the model with the best performance was the Random Forest Classifier, the confusion matrix is as follows.

PCA

153 Genes

The model with less error was the SVM and the Multilayer Perceptron (MLP), however, both models predict with a high accuracy value the different types of cancer. It’s important to mention that in this case the classes were balanced using the Bootstrapping technique. The confusion matrix as follows corresponds to the MLP.

PCA

As you can see in the figure, this models predicts correctly 70 instances and only 2 instances incorrectly. Which means this model is capable of predict and classify different types of Breast Cancer by analyzing gene expressions with a high accuracy.

Scores

  • Random Forest- F1-Score: 86.95 %
  • Multilayer Perceptron- Accuracy: 97.22 %

Link to the Github Repository

Download my Article in Spanish