Medical Symptom Analyzer

The goal of the application is to predict a disease based on the symptoms selected by a user. The prediction is made using machine learning models trained on data stored in a SQLite database.

Here are some key concepts involved in this code:

Machine Learning: It refers to the practice of training algorithms to learn from data, rather than relying on explicit programming to perform tasks. In this case, the algorithm is trained to predict the disease based on a user's symptoms.
SQLite: A lightweight database engine that stores data in a file format, making it easy to use in small-scale applications.
Scikit-learn: A Python library for machine learning that provides various algorithms and tools for data preprocessing, model training, and evaluation. It is used here to apply several models to predict the disease.
Statistical Methods: These are used to process and analyze the data, such as determining the mode (most common value) for predictions.

2) Preprocessing the Data

Drop Missing Data: dropna(axis=1) removes any columns that contain missing values (NaN).

Label Encoding: The LabelEncoder is used to convert the categorical values in the prognosis column into numerical values, making them suitable for machine learning models.

Feature Matrix and Target Vector: X contains the features (symptoms), and y contains the target (disease outcomes).

1) Loading Data from SQLite Database

SQLite Connection: The function load_data_from_sqlite() connects to a local SQLite database file (training_data.db), retrieves data from the table training_table, and stores it in a Pandas DataFrame.

DataFrame: A 2D structure that holds data in a table format with rows and columns, allowing easy data manipulation and analysis.

3) Training the Machine Learning Models

Here, three machine learning models are trained on the data:

SVC (Support Vector Classifier): A classification model that finds the optimal hyperplane that separates different classes.
GaussianNB (Naive Bayes): A probabilistic model based on Bayes' theorem, assuming features are conditionally independent.
Random Forest: An ensemble model that combines multiple decision trees to improve prediction accuracy.

4) Creating a Symptom Dictionary

Formatting: The symptom names are formatted (replacing underscores with spaces and capitalizing the first letter of each word) to make them more readable.

Symptom Dictionary: A dictionary symptom_index is created to map each symptom (column name in X) to a unique index.

5) Predicting Disease

Input Data Creation: The function predictDisease() creates an input vector with binary values (0 or 1) based on the selected symptoms. The vector corresponds to the symptoms (1 if the symptom is selected, otherwise 0).

Prediction: The predictions from all three models (SVM, Naive Bayes, and Random Forest) are collected.
Final Prediction: The mode (most common prediction) of the three models is used as the final disease prediction.

Visualize the project on GitHub