Breast Cancer Tumor

This is a Machine Learning project that uses the breast cancer dataset, the objective is to predict the diagnosis by analyzing features of the tumor such as area, perimeter, radius, smoothness, compactness, concavity, symmetry, texture, among others.

The used dataset was The Breast Cancer Wisconsin, you can downloaded from the kaggle repository. Created a tool that estimates whether a tumor is malignant of benign by analyzing different features of itself. Models such as Logistic Regression, Support Vector Machines, K-Nearest Neighbors and Decision Trees obtained the best performance, with accuracy values greater than 90 % and F1-score greater than 93 %.

Features of the data

The following are the features contained in the Breast Cancer Wisconsin Dataset :

Id
Diagnosis
Radius_mean
Texture_mean
Perimeter_mean
Area_mean
Smoothness_mean
Compactness_mean
Concave points_mean
Symmetry_mean
Fractal_dimension_mean
Radius_se
Texture_se
Perimeter_se
Area_se
Smoothness_se
Compactness_se
Concavity_se
Concave points_se
Symmetry_se
Fractal_dimension_se
Radius_worst
Texture_worst
Perimeter_worst
Area_worst
Smoothness_worst
Compactness_worst
Concavity_worst
Concave points_worst
Symmetry_worst
Fractal_dimension_worst 4
Unnamed: 32

Data Cleaning

Drop the Id and the Unnamed: 32 columns, do not provide useful information.
Replace the categorical values of diagnosis such as (‘M’ - Malignant) and (‘B’ - Benign) to numerical values such as 0 and 1 respectively.
Scale the values in the columns to a range of 0 to 100 in order to balance the data.
Finally, save the processed data.

Exploratory Data Analysis (EDA)

Verify the correlation between the features with the diagnosis

eda

Plot scatters of different features

eda

Model performance

The models I tried are the following :

Logistic Regression
Support Vector Machines
K-Nearest Neighbors
Decision Trees

And the model with the best performance is the Logistic Regression, following the confusion matrix. As you can see, the model predicts 112 correctly instances and only 2 incorrectly instances.

eda

Link to Github Repository