Friday, July 21, 2023

THE APPLIED DATA SCIENCE WORKSHOP: Urinary Biomarkers Based Pancreatic Cancer Classification and Prediction Using Machine Learning with Python GUI --- SECOND EDITION (VIVIAN SIAHAAN)

 Dataset

Google Play Book

Amazon Kindle

Amazon Paperback

Kobo Store



The Applied Data Science Workshop on "Urinary Biomarkers-Based Pancreatic Cancer Classification and Prediction Using Machine Learning with Python GUI" embarks on a comprehensive journey, commencing with an in-depth exploration of the dataset. During this initial phase, the structure and size of the dataset are thoroughly examined, and the various features it contains are meticulously studied. The principal objective is to understand the relationship between these features and the target variable, which, in this case, is the diagnosis of pancreatic cancer. The distribution of each feature is analyzed, and potential patterns, trends, or outliers that could significantly impact the model's performance are identified.


To ensure the data is in optimal condition for model training, preprocessing steps are undertaken. This involves handling missing values through imputation techniques, such as mean, median, or interpolation, depending on the nature of the data. Additionally, feature engineering is performed to derive new features or transform existing ones, with the aim of enhancing the model's predictive power. In preparation for model building, the dataset is split into training and testing sets. This division is crucial to assess the models' generalization performance on unseen data accurately. To maintain a balanced representation of classes in both sets, stratified sampling is employed, mitigating potential biases in the model evaluation process.


The workshop explores an array of machine learning classifiers suitable for pancreatic cancer classification, such as Logistic Regression, K-Nearest Neighbors, Decision Trees, Random Forests, Gradient Boosting, Naive Bayes, Adaboost, Extreme Gradient Boosting, Light Gradient Boosting, Naïve Bayes, and Multi-Layer Perceptron (MLP). For each classifier, three different preprocessing techniques are applied to investigate their impact on model performance: raw (unprocessed data), normalization (scaling data to a similar range), and standardization (scaling data to have zero mean and unit variance).


To optimize the classifiers' hyperparameters and boost their predictive capabilities, GridSearchCV, a technique for hyperparameter tuning, is employed. GridSearchCV conducts an exhaustive search over a specified hyperparameter grid, evaluating different combinations to identify the optimal settings for each model and preprocessing technique.


During the model evaluation phase, multiple performance metrics are utilized to gauge the efficacy of the classifiers. Commonly used metrics include accuracy, recall, precision, and F1-score. By comprehensively assessing these metrics, the strengths and weaknesses of each model are revealed, enabling a deeper understanding of their performance across different classes of pancreatic cancer. Classification reports are generated to present a detailed breakdown of the models' performance, including precision, recall, F1-score, and support for each class. These reports serve as valuable tools for interpreting model outputs and identifying areas for potential improvement.


The workshop highlights the significance of graphical user interfaces (GUIs) in facilitating user interactions with machine learning models. By integrating PyQt, a powerful GUI development library for Python, participants create a user-friendly interface that enables users to interact with the models effortlessly. The GUI provides options to select different preprocessing techniques, visualize model outputs such as confusion matrices and decision boundaries, and gain insights into the models' classification capabilities. One of the primary advantages of the graphical user interface is its ability to offer users a seamless and intuitive experience in predicting and classifying pancreatic cancer based on urinary biomarkers. The GUI empowers users to make informed decisions by allowing them to compare the performance of different classifiers under various preprocessing techniques.


Throughout the workshop, a strong emphasis is placed on the significance of proper data preprocessing, hyperparameter tuning, and robust model evaluation. These crucial steps contribute to building accurate and reliable machine learning models for pancreatic cancer prediction. By the culmination of the workshop, participants have gained valuable hands-on experience in data exploration, machine learning model building, hyperparameter tuning, and GUI development, all geared towards addressing the specific challenge of pancreatic cancer classification and prediction.


In conclusion, the Applied Data Science Workshop on "Urinary Biomarkers-Based Pancreatic Cancer Classification and Prediction Using Machine Learning with Python GUI" embarks on a comprehensive and transformative journey, bringing together data exploration, preprocessing, machine learning model selection, hyperparameter tuning, model evaluation, and GUI development. The project's focus on pancreatic cancer prediction using urinary biomarkers aligns with the pressing need for early detection and treatment of this deadly disease. As participants delve into the intricacies of machine learning and medical research, they contribute to the broader scientific community's ongoing efforts to combat cancer and improve patient outcomes. Through the integration of data science methodologies and powerful visualization tools, the workshop exemplifies the potential of machine learning in revolutionizing medical diagnostics and healthcare practices.


























#pancreatic.py
import numpy as np 
import pandas as pd 
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
import os
import plotly.graph_objs as go
import joblib
import itertools
from sklearn.metrics import roc_auc_score,roc_curve
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV,StratifiedKFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score
from sklearn.metrics import classification_report, f1_score, plot_confusion_matrix
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import learning_curve
from mlxtend.plotting import plot_decision_regions

#Reads dataset
curr_path = os.getcwd() 
df = pd.read_csv(curr_path+"/Debernardi et al 2020 data.csv")
print(df.iloc[:,0:8].head().to_string())
print(df.iloc[:,8:14].head().to_string())

#Checks shape
print(df.shape)

#Reads columns
print("Data Columns --> ",df.columns)

#Checks dataset information
print(df.info())

#Drops irrelevant columns
df = df.drop(columns=['sample_id','patient_cohort','sample_origin','stage','benign_sample_diagnosis'])

#Checks null values
print(df.isnull().sum())
print('Total number of null values: ', df.isnull().sum().sum())

#Imputes missing values in plasma_CA19_9 with mean
df['plasma_CA19_9'].fillna((df['plasma_CA19_9'].mean()), inplace=True)

#Imputes missing value in REG1A with mean
df['REG1A'].fillna((df['REG1A'].mean()), inplace=True)

#Checks null values
print(df.isnull().sum())
print('Total number of null values: ', df.isnull().sum().sum())

#Looks at statistical description of data
print(df.describe().iloc[:,0:5].to_string())
print(df.describe().iloc[:,5:10].to_string())

#Defines function to create pie chart and bar plot as subplots
def plot_piechart(df, var, title=''):
    plt.figure(figsize=(25, 10))
    plt.subplot(121)
    label_list = list(df[var].value_counts().index)
    colors = sns.color_palette("husl", len(label_list))
    df[var].value_counts().plot.pie(autopct="%1.1f%%", \
         colors=colors, \
         startangle=60, labels=label_list, \
         wedgeprops={"linewidth": 3, "edgecolor": "k"}, \
         shadow=True, textprops={'fontsize': 20})
    plt.title("Distribution of " + var + " variable " + title, fontsize=25)

    value_counts = df[var].value_counts()
    # Print percentage values
    percentages = value_counts / len(df) * 100
    print("Percentage values:")
    print(percentages)

    plt.subplot(122)
    ax = df[var].value_counts().plot(kind="barh")

    for i, j in enumerate(df[var].value_counts().values):
        ax.text(.7, i, j, weight="bold", fontsize=20)

    plt.title("Count of " + var + " cases " + title, fontsize=25)
    # Print count values
    print("Count values:")
    print(value_counts)
    plt.show()

plot_piechart(df,'diagnosis')

# Looks at distribution of all features in the whole original dataset
columns = list(df.columns)
columns.remove('diagnosis')
plt.subplots(figsize=(45, 50))
length = len(columns)
color_palette = sns.color_palette("Set3", n_colors=length)  # Define color palette

for i, j in itertools.zip_longest(columns, range(length)):
    plt.subplot((length // 2), 4, j + 1)
    plt.subplots_adjust(wspace=0.2, hspace=0.5)
    ax = df[i].hist(bins=10, edgecolor='black', color=color_palette[j])  # Set color for each histogram
    for p in ax.patches:
        ax.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center',
                    va='center', xytext=(0, 10), weight="bold", fontsize=17, textcoords='offset points')

    plt.title(i, fontsize=30)  # Adjust title font size
plt.show()

from tabulate import tabulate
def another_versus_diagnosis(feat, num_bins):
    fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(30, 22))
    plt.subplots_adjust(wspace=0.5, hspace=0.25)
    
    colors = sns.color_palette("Set2")
    diagnosis_labels = {1: 'Control (No Pancreatic Disease)',
                        2: 'Benign Hepatobiliary Disease',
                        3: 'Pancreatic Cancer'}
    
    data = {}
    
    for diagnosis_code, ax in zip([1, 2, 3], axes):
        subset_data = df[df['diagnosis'] == diagnosis_code][feat]
        subset_data.plot(ax=ax, kind='hist', bins=num_bins, edgecolor='black', color=colors[diagnosis_code-1])
        
        ax.set_title(diagnosis_labels[diagnosis_code], fontsize=30)
        ax.set_xlabel(feat, fontsize=30)
        ax.set_ylabel('Count', fontsize=30)
        
        patch_data = []
        for p in ax.patches:
            x = p.get_x() + p.get_width() / 2.
            y = p.get_height()
            ax.annotate(format(y, '.0f'), (x, y), ha='center', va='center', xytext=(0, 10),
                         weight="bold", fontsize=25, textcoords='offset points')
            patch_data.append([x, y])
        
        data[diagnosis_labels[diagnosis_code]] = patch_data
    
    plt.show()

    for diagnosis_label, patch_data in data.items():
        print(diagnosis_label + ":")
        print(tabulate(patch_data, headers=[feat, diagnosis_label]))
        print()
    
#Looks at plasma_CA19_9 feature distribution by diagnosis feature
another_versus_diagnosis("plasma_CA19_9", 10)

#Looks at creatinine feature distribution by diagnosis feature
another_versus_diagnosis("creatinine", 10)

#Looks at LYVE1 feature distribution by diagnosis feature
another_versus_diagnosis("LYVE1", 10)

#Looks at REG1B feature distribution by diagnosis feature
another_versus_diagnosis("REG1B", 10)

#Looks at TFF1 feature distribution by diagnosis feature
another_versus_diagnosis("TFF1", 10)

#Looks at REG1A feature distribution by diagnosis feature
another_versus_diagnosis("REG1A", 10)

#Creates a dummy dataframe for visualization
df_dummy=df.copy()

#Categorizes diagnosis feature
def cat_diagnosis(n):
    if n == 1:
        return 'Control (No Pancreatic Disease)'
    if n == 2:
        return 'Benign Hepatobiliary Disease'    
    else:
        return 'Pancreatic Cancer'
    
df_dummy['diagnosis'] = df_dummy['diagnosis'].apply(lambda x: cat_diagnosis(x))

def put_label_stacked_bar(ax,fontsize):
    #patches is everything inside of the chart
    for rect in ax.patches:
        # Find where everything is located
        height = rect.get_height()
        width = rect.get_width()
        x = rect.get_x()
        y = rect.get_y()
    
        # The height of the bar is the data value and can be used as the label
        label_text = f'{height:.0f}'  
    
        # ax.text(x, y, text)
        label_x = x + width / 2
        label_y = y + height / 2

        # plots only when height is greater than specified value
        if height > 0:
            ax.text(label_x, label_y, label_text, \
                ha='center', va='center', \
                weight = "bold",fontsize=fontsize)
    
#Plots one variable against another variable
def dist_one_vs_another_plot(df, cat1, cat2):
    fig = plt.figure(figsize=(25, 15))
    ax1 = fig.add_subplot(111)
    group_by_stat = df.groupby([cat1, cat2]).size()
    stacked_data = group_by_stat.unstack()
    group_by_stat.unstack().plot(kind='bar', stacked=True, ax=ax1, grid=True)
    ax1.set_title('Stacked Bar Plot of ' + cat1 + ' (number of cases)', fontsize=30)
    ax1.set_ylabel('Number of Cases', fontsize=20)
    ax1.set_xlabel(cat1, fontsize=20)
    put_label_stacked_bar(ax1,15)
    plt.show()

    # Group values by cat2
    sentiment_groups = stacked_data.groupby(level=0, axis=0)

    # Create table headers
    headers = [cat2 for cat2 in stacked_data.columns]

    # Create table rows with data
    rows = []
    for cat, group_data in sentiment_groups:
        row_values = [str(val) for val in group_data.values.flatten()]
        rows.append([cat] + row_values)

    # Print the table
    print(tabulate(rows, headers=headers, tablefmt='grid'))

#Categorizes age feature
labels = ['0-40', '40-50', '50-60','60-90']
df_dummy['age'] = pd.cut(df_dummy['age'], [0, 40, 50, 60, 90], labels=labels)

#Plots the distribution of age feature in pie chart and bar plot
plot_piechart(df_dummy,'age',)

#Plots diagnosis variable against age variable in stacked bar plots
dist_one_vs_another_plot(df_dummy,'age', 'diagnosis')

#Plots the distribution of sex feature in pie chart and bar plot
plot_piechart(df_dummy,'sex')

#Plots diagnosis variable against sex variable in stacked bar plots
dist_one_vs_another_plot(df_dummy,'sex', 'diagnosis')

#Categorizes plasma_CA19_9 feature
labels = ['0-100', '100-1000', '1000-10000','10000-35000']
df_dummy['plasma_CA19_9'] = pd.cut(df_dummy['plasma_CA19_9'], [0, 100, 1000, 10000, 35000], labels=labels)

#Plots the distribution of plasma_CA19_9 feature in pie chart and bar plot
plot_piechart(df_dummy,'plasma_CA19_9')

#Plots diagnosis variable against plasma_CA19_9 variable in stacked bar plots
dist_one_vs_another_plot(df_dummy,'plasma_CA19_9', 'diagnosis')

#Categorizes creatinine feature
labels = ['0-0.5', '0.5-1', '1-2','2-5']
df_dummy['creatinine'] = pd.cut(df_dummy['creatinine'], [0, 0.5, 1, 2, 5], labels=labels)

#Plots the distribution of creatinine feature in pie chart and bar plot
plot_piechart(df_dummy,'creatinine')

#Plots diagnosis variable against creatinine variable in stacked bar plots
dist_one_vs_another_plot(df_dummy,'creatinine', 'diagnosis')

#Checks dataset information
print(df_dummy.info())

#Extracts categorical and numerical columns
cat_cols = [col for col in df_dummy.columns if (df_dummy[col].dtype == 'object' or df_dummy[col].dtype.name == 'category')]
num_cols = [col for col in df_dummy.columns if (df_dummy[col].dtype != 'object' and df_dummy[col].dtype.name != 'category')]

print(cat_cols)
print(num_cols)

#Checks numerical features density distribution
# Define a custom color palette
colors = sns.color_palette("husl", len(num_cols))

# Checks numerical features density distribution
fig = plt.figure(figsize=(30, 20))
plotnumber = 1

for i, column in enumerate(num_cols):
    if plotnumber <= 6:
        ax = plt.subplot(2, 2, plotnumber)
        sns.distplot(df_dummy[column], color=colors[i])  # Use the custom color for the plot
        plt.xlabel(column, fontsize=40)
        for p in ax.patches:
            ax.annotate(format(p.get_height(), '.2f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), weight="bold", fontsize=30, textcoords='offset points')
    plotnumber += 1

fig.suptitle('The density of numerical features', fontsize=50)
plt.tight_layout()
plt.show()

#Checks categorical features distribution
fig=plt.figure(figsize = (35, 25))
plotnumber = 1
for column in cat_cols:
    if plotnumber <= 6:
        ax = plt.subplot(2, 3, plotnumber)
        sns.countplot(df_dummy[column], palette = 'Spectral_r')
        plt.xlabel(column,fontsize=40)
        for p in ax.patches:
            ax.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), weight = "bold",fontsize=30, textcoords = 'offset points')

    plotnumber += 1
fig.suptitle('The distribution of categorical features distribution', fontsize=50)
plt.tight_layout()
plt.show()

def plot_four_versus_one(df, column_names, feat):
    num_plots = len(column_names)
    num_rows = num_plots // 2 + num_plots % 2
    fig, ax = plt.subplots(num_rows, 2, figsize=(20, 13), facecolor='#fbe7dd')

    for i, column in enumerate(column_names):
        current_ax = ax[i // 2, i % 2]
        g = sns.countplot(df[column], hue=df[feat], palette='Spectral_r', ax=current_ax)
        
        for p in g.patches:
            g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), weight="bold", fontsize=20, textcoords='offset points')
        
        current_ax.set_xlabel(column, fontsize=20)
        current_ax.set_ylabel("Count", fontsize=20)
        current_ax.tick_params(axis='x', labelsize=15)
        current_ax.tick_params(axis='y', labelsize=15)
        
    plt.tight_layout()
    plt.show()
    
#Plots distribution of number of cases of four categorical features versus diagnosis
column_names = ["age", "sex", "plasma_CA19_9", "creatinine"]
plot_four_versus_one(df_dummy, column_names, "diagnosis")


#Plots distribution of number of cases of four categorical features versus creatinine
column_names = ["age", "sex", "plasma_CA19_9", "diagnosis"]
plot_four_versus_one(df_dummy, column_names, "creatinine")

#Plots distribution of number of cases of four categorical features versus age
column_names = ["creatinine", "sex", "plasma_CA19_9", "diagnosis"]
plot_four_versus_one(df_dummy, column_names, "age")

#Plots distribution of number of cases of four categorical features versus sex
column_names = ["creatinine", "age", "plasma_CA19_9", "diagnosis"]
plot_four_versus_one(df_dummy, column_names, "sex")

#Plots distribution of number of cases of four categorical features versus plasma_CA19_9
column_names = ["creatinine", "age", "sex", "diagnosis"]
plot_four_versus_one(df_dummy, column_names, "plasma_CA19_9")

#Categorizes diagnosis feature
def cat_diagnosis(n):
    if n == 1:
        return 'Control (No Pancreatic Disease)'
    if n == 2:
        return 'Benign Hepatobiliary Disease'    
    else:
        return 'Pancreatic Cancer'
  
#Plots distribution of age and sex versus diagnosis in pie chart
def plot_piechart_diagnosis(df, feat1, feat2):
    gs0 = df_dummy[df_dummy.diagnosis == 'Control (No Pancreatic Disease)'][feat1].value_counts()
    gs1 = df_dummy[df_dummy.diagnosis == 'Benign Hepatobiliary Disease'][feat1].value_counts()
    gs2 = df_dummy[df_dummy.diagnosis == 'Pancreatic Cancer'][feat1].value_counts()
    ss0 = df_dummy[df_dummy.diagnosis == 'Control (No Pancreatic Disease)'][feat2].value_counts()
    ss1 = df_dummy[df_dummy.diagnosis == 'Benign Hepatobiliary Disease'][feat2].value_counts()
    ss2 = df_dummy[df_dummy.diagnosis == 'Pancreatic Cancer'][feat2].value_counts()

    label_gs0=list(gs0.index)
    label_gs1=list(gs1.index)
    label_gs2=list(gs2.index)
    label_ss0=list(ss0.index)
    label_ss1=list(ss1.index)
    label_ss2=list(ss2.index)

    fig, ax = plt.subplots(2, 3, figsize=(35, 20), facecolor='#fbe7dd')

    def print_percentage_table(data, labels, title):
        percentages = [f'{(value / sum(data)) * 100:.1f}%' for value in data]
        table_data = list(zip(labels, percentages))
        headers = [feat1, 'Percentage']
        print(f"\n{title}:")
        print(tabulate(table_data, headers=headers, tablefmt='grid'))

    def plot_pie(ax, data, labels, title):
        ax.pie(data, labels=labels, shadow=True, autopct='%1.1f%%', textprops={'fontsize': 32})
        ax.set_xlabel(title, fontsize=30)

    plot_pie(ax[0, 0], gs0, label_gs0, f"{feat1} feature")
    print_percentage_table(gs0, label_gs0, 'diagnosis = Control (No Pancreatic Disease)')

    plot_pie(ax[0, 1], gs1, label_gs1, f"{feat1} feature")
    print_percentage_table(gs1, label_gs1, 'diagnosis = Benign Hepatobiliary Disease')

    plot_pie(ax[0, 2], gs1, label_gs1, f"{feat1} feature")
    print_percentage_table(gs1, label_gs2, 'diagnosis = Pancreatic Cancer')
    
    plot_pie(ax[1, 0], ss0, label_ss0, f"{feat2} feature")
    print_percentage_table(ss0, label_ss0, 'diagnosis = Control (No Pancreatic Disease)')

    plot_pie(ax[1, 1], ss1, label_ss1, f"{feat2} feature")
    print_percentage_table(ss1, label_ss1, 'diagnosis = Benign Hepatobiliary Disease')

    plot_pie(ax[1, 2], ss1, label_ss1, f"{feat2} feature")
    print_percentage_table(ss1, label_ss2, 'diagnosis = Pancreatic Cancer')
    
    ax[0][0].set_title('diagnosis = Control (No Pancreatic Disease)',fontsize= 30)
    ax[0][1].set_title('diagnosis = Benign Hepatobiliary Disease',fontsize= 30)
    ax[0][2].set_title('diagnosis = Pancreatic Cancer',fontsize= 30)
    plt.tight_layout()
    plt.show()

#Plots distribution of age and sex versus diagnosis in pie chart  
plot_piechart_diagnosis(df_dummy, "age", "sex")

#Plots distribution of plasma_CA19_9 and creatinine versus diagnosis in pie chart
plot_piechart_diagnosis(df_dummy, "plasma_CA19_9", "sex")

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10,5))
gs = fig.add_gridspec(2, 2)
gs.update(wspace=0.15, hspace=0.25)

background_color = "#fbe7dd"
sns.set_palette(['#ff355d','#ffd514'])

def feat_versus_other(feat,another,legend,ax0,label):
    for s in ["right", "top"]:
        ax0.spines[s].set_visible(False)

    ax0.set_facecolor(background_color)
    ax0_sns = sns.histplot(data=df, x=feat,ax=ax0,zorder=2,kde=False,hue=another,multiple="stack", shrink=.8
                      ,linewidth=0.3,alpha=1)

    put_label_stacked_bar(ax0_sns,5)
    ax0_sns.set_xlabel('',fontsize=4, weight='bold')
    ax0_sns.set_ylabel('',fontsize=4, weight='bold')

    ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)

    ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
    ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8', edgecolor=background_color, fontsize=3, bbox_to_anchor=(1, 0.989), loc='upper right')
    ax0.set_facecolor(background_color)
    ax0_sns.set_xlabel(label)
    plt.tight_layout()

def prob_feat_versus_other(feat,another,legend,ax0,label):
    for s in ["right", "top"]:
        ax0.spines[s].set_visible(False)

    ax0.set_facecolor(background_color)
    ax0_sns = sns.kdeplot(x=feat,ax=ax0,hue=another,linewidth=0.3,fill=True,cbar='g',zorder=2,alpha=1,multiple='stack')

    ax0_sns.set_xlabel('',fontsize=4, weight='bold')
    ax0_sns.set_ylabel('',fontsize=4, weight='bold')

    ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)

    ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
    ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8', edgecolor=background_color, fontsize=3, bbox_to_anchor=(1, 0.989), loc='upper right')
    ax0.set_facecolor(background_color)
    ax0_sns.set_xlabel(label)
    plt.tight_layout()
    
label_diag = list(df_dummy["diagnosis"].value_counts().index)
label_age = list(df_dummy["age"].value_counts().index)
label_plas = list(df_dummy["plasma_CA19_9"].value_counts().index)
label_sex = list(df_dummy["sex"].value_counts().index)    
    
def hist_feat_versus_four_cat(feat,label):
    ax0 = fig.add_subplot(gs[0, 0])
    feat_versus_other(feat,df_dummy["diagnosis"],label_diag,ax0,"diagnosis versus " + label)

    ax1 = fig.add_subplot(gs[0, 1])
    feat_versus_other(feat,df_dummy["age"],label_age,ax1,"age versus " + label)

    ax2 = fig.add_subplot(gs[1, 0])
    feat_versus_other(feat,df_dummy["plasma_CA19_9"],label_plas,ax2,"plasma_CA19_9 versus " + label)

    ax3 = fig.add_subplot(gs[1, 1])
    feat_versus_other(feat,df_dummy["creatinine"],label_sex,ax3,"sex versus " + label)

def prob_feat_versus_four_cat(feat,label):
    ax0 = fig.add_subplot(gs[0, 0])
    prob_feat_versus_other(feat,df_dummy["diagnosis"],label_diag,ax0,"diagnosis versus " + label)

    ax1 = fig.add_subplot(gs[0, 1])
    prob_feat_versus_other(feat,df_dummy["age"],label_age,ax1,"age versus " + label)

    ax2 = fig.add_subplot(gs[1, 0])
    prob_feat_versus_other(feat,df_dummy["plasma_CA19_9"],label_plas,ax2,"plasma_CA19_9 versus " + label)

    ax3 = fig.add_subplot(gs[1, 1])
    prob_feat_versus_other(feat,df_dummy["creatinine"],label_sex,ax3,"sex versus " + label)    
    

#hist_feat_versus_four_cat(df_dummy["LYVE1"],"LYVE1") 
prob_feat_versus_four_cat(df_dummy["LYVE1"],"LYVE1")   

hist_feat_versus_four_cat(df_dummy["REG1B"],"REG1B") 
prob_feat_versus_four_cat(df_dummy["REG1B"],"REG1B")   

hist_feat_versus_four_cat(df_dummy["TFF1"],"TFF1") 
prob_feat_versus_four_cat(df_dummy["TFF1"],"TFF1")   

#hist_feat_versus_four_cat(df_dummy["REG1A"],"REG1A") 
prob_feat_versus_four_cat(df_dummy["REG1A"],"REG1A")     

#Converts sex feature to {0,1}
def map_sex(n):
    if n == "F":
        return 0
    
    else:
        return 1   
df['sex'] = df['sex'].apply(lambda x: map_sex(x))

#Converts diagnosis feature to {0,1,2}
def map_diagnosis(n):
    if n == 1:
        return 0
    if n == 2:
        return 1    
    else:
        return 2   
df['diagnosis'] = df['diagnosis'].apply(lambda x: map_diagnosis(x))

#Extracts outuput and input variables
y = df['diagnosis'].values # Target for the model
X = df.drop(['diagnosis'], axis = 1)

#Feature Importance using RandomForest Classifier
names = X.columns
rf = RandomForestClassifier()
rf.fit(X, y)

result_rf = pd.DataFrame()
result_rf['Features'] = X.columns
result_rf ['Values'] = rf.feature_importances_
result_rf.sort_values('Values', inplace = True, ascending = False)

plt.figure(figsize=(25,25))
sns.set_color_codes("pastel")
sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Blue")
plt.xlabel('Feature Importance',  fontsize=30) 
plt.ylabel('Feature Labels',  fontsize=30) 
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.show()

# Print the feature importance table
print("Feature Importance:")
print(result_rf)       

#Feature Importance using ExtraTreesClassifier   
model = ExtraTreesClassifier()
model.fit(X, y)

result_et = pd.DataFrame()
result_et['Features'] = X.columns
result_et ['Values'] = model.feature_importances_
result_et.sort_values('Values', inplace=True, ascending =False)

plt.figure(figsize=(25,25))
sns.set_color_codes("pastel")
sns.barplot(x = 'Values',y = 'Features', data=result_et, color="red")
plt.xlabel('Feature Importance',  fontsize=30) 
plt.ylabel('Feature Labels',  fontsize=30) 
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.show()   

# Print the feature importance table
print("Feature Importance:")
print(result_et)    

#Feature Importance using RFE      
from sklearn.feature_selection import RFE
model = LogisticRegression()
# create the RFE model
rfe = RFE(model)
rfe = rfe.fit(X, y)

result_lg = pd.DataFrame()
result_lg['Features'] = X.columns
result_lg ['Ranking'] = rfe.ranking_
result_lg.sort_values('Ranking', inplace=True , ascending = False)

plt.figure(figsize=(25,25))
sns.set_color_codes("pastel")
sns.barplot(x = 'Ranking',y = 'Features', data=result_lg, color="orange")
plt.ylabel('Feature Labels',  fontsize=30) 
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.show()  

print("Feature Ranking:")
print(result_lg)    

#Splits the data into training and testing
sm = SMOTE(random_state=42)
X,y = sm.fit_resample(X, y.ravel())

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2021, stratify=y)   
X_train_raw = X_train.copy()
X_test_raw = X_test.copy()
y_train_raw = y_train.copy()
y_test_raw = y_test.copy()

X_train_norm = X_train.copy()
X_test_norm = X_test.copy()
y_train_norm = y_train.copy()
y_test_norm = y_test.copy()
norm = MinMaxScaler()
X_train_norm = norm.fit_transform(X_train_norm)
X_test_norm = norm.transform(X_test_norm)

X_train_stand = X_train.copy()
X_test_stand = X_test.copy()
y_train_stand = y_train.copy()
y_test_stand = y_test.copy()
scaler = StandardScaler()
X_train_stand = scaler.fit_transform(X_train_stand)
X_test_stand = scaler.transform(X_test_stand)

def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    if axes is None:
        _, axes = plt.subplots(3, 1, figsize=(50, 50))

    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                       train_sizes=train_sizes,
                       return_times=True)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score", lw=10)
    axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score", lw=10)
    axes[0].legend(loc="best")
    axes[0].set_title('Learning Curve', fontsize=50)
    axes[0].set_xlabel('Training Examples', fontsize=40)
    axes[0].set_ylabel('Score', fontsize=40)
    axes[0].tick_params(labelsize=30)

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, 'o-', lw=10)
    axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
                         fit_times_mean + fit_times_std, alpha=0.1)
    axes[1].set_xlabel("Training examples", fontsize=40)
    axes[1].set_ylabel("fit_times", fontsize=40)
    axes[1].set_title("Scalability of the model", fontsize=50)
    axes[1].tick_params(labelsize=30)
    
    # Plot fit_time vs score
    axes[2].grid()
    axes[2].plot(fit_times_mean, test_scores_mean, 'o-', lw=10)
    axes[2].fill_between(fit_times_mean, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1)
    axes[2].set_xlabel("fit_times", fontsize=40)
    axes[2].set_ylabel("Score", fontsize=40)
    axes[2].set_title("Performance of the model", fontsize=50)

    return plt  

def plot_real_pred_val(Y_test, ypred, name):
    plt.figure(figsize=(20,12))
    acc=accuracy_score(Y_test,ypred)
    plt.scatter(range(len(ypred)),ypred,color="blue",lw=5,label="Predicted")
    plt.scatter(range(len(Y_test)), Y_test,color="red",label="Actual")
    plt.title("Predicted Values vs True Values of " + name, fontsize=30)
    plt.xlabel("Accuracy: " + str(round((acc*100),3)) + "%", fontsize=30)
    plt.legend()
    plt.grid(True, alpha=0.75, lw=1, ls='-.')
    plt.show()

def plot_cm(Y_test, ypred, name):
    fig, ax = plt.subplots(figsize=(25, 15))
    cm = confusion_matrix(Y_test, ypred)
    sns.heatmap(cm, annot=True, linewidth=0.7, linecolor='red', fmt='g', cmap="YlOrBr", annot_kws={"size": 30})
    plt.title(name + ' Confusion Matrix', fontsize=30)
    ax.xaxis.set_ticklabels(['Control (No Pancreatic Disease)', 'Benign Hepatobiliary Disease', 'Pancreatic Cancer'], fontsize=20);   
    ax.yaxis.set_ticklabels(['Control (No Pancreatic Disease)', 'Benign Hepatobiliary Disease', 'Pancreatic Cancer'], fontsize=20); 
    plt.xlabel('Y predict', fontsize=30)
    plt.ylabel('Y test', fontsize=30)
    plt.show()
    return cm
    
#Plots ROC
def plot_roc(model,X_test, y_test, title):
    Y_pred_prob = model.predict_proba(X_test)
    Y_pred_prob = Y_pred_prob[:, 1]

    fpr, tpr, thresholds = roc_curve(y_test, Y_pred_prob)
    plt.figure(figsize=(25,15))
    plt.plot([0,1],[0,1], color='navy', lw=10, linestyle='--')
    plt.plot(fpr,tpr, color='red', lw=10)
    plt.xlabel('False Positive Rate', fontsize=30)
    plt.ylabel('True Positive Rate', fontsize=30)
    plt.title('ROC Curve of ' + title, fontsize=30)
    plt.grid(True)
    plt.show()
    
def plot_decision_boundary(model,xtest, ytest, name):
    plt.figure(figsize=(25, 15))     
    #Trains model with two features
    model.fit(xtest, ytest)

    plot_decision_regions(xtest.values, ytest.ravel(), \
        clf=model, legend=2)
    plt.title("Decision boundary for " + name + " (Test)", fontsize=30)
    plt.xlabel("creatinine", fontsize=25)
    plt.ylabel("LYVE1", fontsize=25)
    plt.legend(fontsize=25)    
    plt.show()        

#Chooses two features for decision boundary
feat_boundary = ['creatinine','LYVE1']
X_feature = X[feat_boundary]
X_train_feat, X_test_feat, y_train_feat, y_test_feat = train_test_split(X_feature, y, test_size = 0.2, random_state = 2021, stratify=y)  
    
def train_model(model, X, y):
    model.fit(X, y)
    return model

def predict_model(model, X, proba=False):
    if ~proba:
        y_pred = model.predict(X)
    else:
        y_pred_proba = model.predict_proba(X)
        y_pred = np.argmax(y_pred_proba, axis=1)

    return y_pred

list_scores = []

def run_model(name, model, X_train, X_test, y_train, y_test, fc, proba=False):
    print(name)
    print(fc)
    
    model = train_model(model, X_train, y_train)
    y_pred = predict_model(model, X_test, proba)
    
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred, average='weighted')
    precision = precision_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    print('accuracy: ', accuracy)
    print('recall: ',recall)
    print('precision: ', precision)
    print('f1: ', f1)
    print(classification_report(y_test, y_pred))
        
    plot_cm(y_test, y_pred, name)
    plot_real_pred_val(y_test, y_pred, name)
    plot_decision_boundary(model,X_test_feat, y_test_feat, name)
    plot_learning_curve(model, name, X_train, y_train, cv=3);    
    plt.show()
    
    list_scores.append({'Model Name': name, 'Feature Scaling':fc, 'Accuracy': accuracy, 'Recall': recall, 'Precision': precision, 'F1':f1})

feature_scaling = {
    #'Raw':(X_train_raw, X_test_raw, y_train_raw, y_test_raw),
    #'Normalization':(X_train_norm, X_test_norm, y_train_norm, y_test_norm),
    'Standardization':(X_train_stand, X_test_stand, y_train_stand, y_test_stand),
}

#Support Vector Classifier
# Define the parameter grid for the Grid Search
param_grid = {
    'C': [0.1, 1, 10],        # Regularization parameter
    'kernel': ['linear', 'rbf'],  # Kernel type
}

# Create the SVC model with probability=True
model_svc = SVC(random_state=2021, probability=True)

# Perform Grid Search for each feature scaling method
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    
    # Initialize GridSearchCV
    grid_search = GridSearchCV(estimator=model_svc, param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    
    # Perform Grid Search and fit the model
    grid_search.fit(X_train, y_train)
    
    # Get the best parameters and best model from the Grid Search
    best_params = grid_search.best_params_
    best_model = grid_search.best_estimator_
    
    # Evaluate the best model
    run_model('SVC with ' + fc_name, best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)

    # Print the best hyperparameters found
    print(f"Best Hyperparameters for {fc_name}:")
    print(grid_search.best_params_)
   
#Logistic Regression Classifier
# Define the parameter grid for the grid search
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'saga'],
}

# Initialize the Logistic Regression model
logreg = LogisticRegression(max_iter=5000, random_state=2021)

# Perform the grid search for each feature scaling method
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    
    # Create GridSearchCV with the Logistic Regression model and the parameter grid
    grid_search = GridSearchCV(logreg, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    
    # Train and perform grid search
    grid_search.fit(X_train, y_train)
    
    # Get the best Logistic Regression model from the grid search
    best_model = grid_search.best_estimator_
    
    # Evaluate and plot the best model (setting proba=True for probability prediction)
    run_model('Logistic Regression', best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)
    
    # Print the best hyperparameters found
    print(f"Best Hyperparameters for {fc_name}:")
    print(grid_search.best_params_)   

#KNN Classifier
# Define the parameter grid for the grid search
param_grid = {
    'n_neighbors': list(range(2, 10))
}

# KNN Classifier Grid Search
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    
    # Initialize the KNN Classifier
    knn = KNeighborsClassifier()
    
    # Create GridSearchCV with the KNN model and the parameter grid
    grid_search = GridSearchCV(knn, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    
    # Train and perform grid search
    grid_search.fit(X_train, y_train)
    
    # Get the best KNN model from the grid search
    best_model = grid_search.best_estimator_
    
    # Evaluate and plot the best model (setting proba=True for probability prediction)
    run_model(f'KNeighbors Classifier n_neighbors = {grid_search.best_params_["n_neighbors"]}',
              best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)
    
    # Print the best hyperparameters found
    print(f"Best Hyperparameters for {fc_name}:")
    print(grid_search.best_params_)

#Decision Tree Classifier      
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    
    # Initialize the DecisionTreeClassifier model
    dt_clf = DecisionTreeClassifier(random_state=2021)
    
    # Define the parameter grid for the grid search
    param_grid = {
        'max_depth': np.arange(1, 51, 1),
        'criterion': ['gini', 'entropy'],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
    }
    
    # Create GridSearchCV with the DecisionTreeClassifier model and the parameter grid
    grid_search = GridSearchCV(dt_clf, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    
    # Train and perform grid search
    grid_search.fit(X_train, y_train)
    
    # Get the best DecisionTreeClassifier model from the grid search
    best_model = grid_search.best_estimator_
    
    # Evaluate and plot the best model (setting proba=True for probability prediction)
    run_model(f'DecisionTree Classifier (Best Depth: {grid_search.best_params_["max_depth"]})',
              best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)
    
    # Print the best hyperparameters found
    print(f"Best Hyperparameters for {fc_name}:")
    print(grid_search.best_params_) 

#Random Forest Classifier    
# Define the parameter grid for the grid search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the RandomForestClassifier model
rf = RandomForestClassifier(random_state=2021)

# RandomForestClassifier Grid Search
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    
    # Create GridSearchCV with the RandomForestClassifier model and the parameter grid
    grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    
    # Train and perform grid search
    grid_search.fit(X_train, y_train)
    
    # Get the best RandomForestClassifier model from the grid search
    best_model = grid_search.best_estimator_
    
    # Evaluate and plot the best model (setting proba=True for probability prediction)
    run_model(f'RandomForest Classifier (Best Estimators: {grid_search.best_params_["n_estimators"]})',
              best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)
    
    # Print the best hyperparameters found
    print(f"Best Hyperparameters for {fc_name}:")
    print(grid_search.best_params_) 

#Gradient Boosting Classifier      
# Initialize the GradientBoostingClassifier model
gbt = GradientBoostingClassifier(random_state=2021)

# Define the parameter grid for the grid search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'subsample': [0.6, 0.8, 1.0],
    'max_features': [0.2, 0.4, 0.6, 0.8, 1.0],
}

# GradientBoosting Classifier Grid Search
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value

    # Create GridSearchCV with the GradientBoostingClassifier model and the parameter grid
    grid_search = GridSearchCV(gbt, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

    # Train and perform grid search
    grid_search.fit(X_train, y_train)

    # Get the best GradientBoostingClassifier model from the grid search
    best_model = grid_search.best_estimator_

    # Evaluate and plot the best model (setting proba=True for probability prediction)
    run_model(f'GradientBoosting Classifier (Best Estimators: {grid_search.best_params_["n_estimators"]})',
              best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)

    # Print the best hyperparameters found
    print(f"Best Hyperparameters for {fc_name}:")
    print(grid_search.best_params_)

#Extreme Gradient Boosting Classifier   
# XGBoost Classifier Grid Search
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value

    # Define the parameter grid for the grid search
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [10, 20, 30],
        'learning_rate': [0.01, 0.1, 0.2],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
    }

    # Initialize the XGBoost classifier
    xgb = XGBClassifier(random_state=2021, use_label_encoder=False, eval_metric='mlogloss')

    # Create GridSearchCV with the XGBoost classifier and the parameter grid
    grid_search = GridSearchCV(xgb, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

    # Train and perform grid search
    grid_search.fit(X_train, y_train)

    # Get the best XGBoost classifier model from the grid search
    best_model = grid_search.best_estimator_

    # Evaluate and plot the best model (setting proba=True for probability prediction)
    run_model(f'XGB Classifier (Best Estimators: {grid_search.best_params_["n_estimators"]})',
              best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)

    # Print the best hyperparameters found
    print(f"Best Hyperparameters for {fc_name}:")
    print(grid_search.best_params_) 

# MLP Classifier Grid Search
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value

    # Define the parameter grid for the grid search
    param_grid = {
        'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50), (100, 100)],
        'activation': ['logistic', 'relu'],
        'solver': ['adam', 'sgd'],
        'alpha': [0.0001, 0.001, 0.01],
        'learning_rate': ['constant', 'invscaling', 'adaptive'],
    }

    # Initialize the MLP Classifier
    mlp = MLPClassifier(random_state=2021)

    # Create GridSearchCV with the MLP Classifier and the parameter grid
    grid_search = GridSearchCV(mlp, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

    # Train and perform grid search
    grid_search.fit(X_train, y_train)

    # Get the best MLP Classifier model from the grid search
    best_model = grid_search.best_estimator_

    # Evaluate and plot the best model (setting proba=True for probability prediction)
    run_model('MLP Classifier', best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)

    # Print the best hyperparameters found
    print(f"Best Hyperparameters for {fc_name}:")
    print(grid_search.best_params_)
   
#LGBM Classifier
# Define the parameter grid for grid search
param_grid = {
    'max_depth': [10, 20, 30],
    'n_estimators': [100, 200, 300],
    'subsample': [0.6, 0.8, 1.0],
    'random_state': [2021]
}

# Initialize the LightGBM classifier
lgbm = LGBMClassifier()

# Grid Search
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value

    # Create GridSearchCV with the LightGBM classifier and the parameter grid
    grid_search = GridSearchCV(lgbm, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

    # Train and perform grid search
    grid_search.fit(X_train, y_train)

    # Get the best LightGBM classifier model from the grid search
    best_model = grid_search.best_estimator_

    # Evaluate and plot the best model (setting proba=True for probability prediction)
    run_model('LGBM Classifier', best_model, X_train, X_test, y_train, y_test, fc_name, proba=True)

    # Print the best hyperparameters found
    print(f"Best Hyperparameters for {fc_name}:")
    print(grid_search.best_params_)


No comments:

Post a Comment