Software Developer and Writer: TKINTER AND DATA SCIENCE: PART 1 (FULL SOURCE CODE)

If you find our source code useful, please support and subscribe our channel (DUKUNG CHANNEL KAMI DENGAN MENSUBSCRIBE):

This source code is divided into four separate classes, each with its own purpose. This code performs preprocessing to prepare data for machine learning and deep learning algorithms (which will be provided in part 2).

In this code, it will display the distribution of each feature and categorized features. The code also displays the correlation matrix and features importance.

Kode sumber ini dibagi empat kelas terpisah, dimana tiap kelas memiliki tujuan masing-masing. Kode ini melakukan preprocessing untuk menyiapkan data bagi algoritma-algoritma machine learning dan deep learning (nanti akan kami sediakan pada part-2).

Pada kode ini, Anda akan menampilkan distribusi tiap fitur dan fitur terkategorisasi. Kode juga menampilkan korelasi matriks dan features importance.

BALIGE ACADEMY TEAM

VIVIAN SIAHAAN

RISMON HASIHOLAN SIANIPAR

HORAS!!!

EXPLAINATION:

The class named Design_Window is intended for designing and setting up a graphical user interface (GUI) using the Tkinter library. The class consists of methods to add various widgets, labels, buttons, canvases, listboxes, and comboboxes to the GUI.

The add_widgets() method adds different types of widgets to the root window. It calls other methods to add buttons, canvases, labels, listboxes, and comboboxes.
The add_buttons() method adds a button labeled "LOAD DATASET" to the GUI. It is positioned at row 0, column 0 and is given specific dimensions.
The add_labels() method adds multiple labels to the GUI, each indicating a different category. The labels are colored and positioned in rows 1, 3, 5, 7, 9, and 11 with distinctive colors.
The add_canvas() method adds two canvases (canvas1 and canvas2) to the GUI. These canvases are used to display visual outputs such as plots. Each canvas is created using the Figure class from Matplotlib and is placed in different columns of the GUI.
The add_listboxes() method adds a listbox widget to the GUI. It allows users to select items from a predefined list. The listbox is filled with items related to various feature names and categories.
The add_comboboxes() method adds five combobox widgets to the GUI. Each combobox allows users to select predefined options from dropdown menus. The options in the comboboxes relate to different types of plots, regressors, machine learning models, and deep learning models.

Overall, the Design_Window class is aimed at creating a user-friendly GUI for interactive data analysis and visualization, along with the selection of different algorithms and models for further analysis.

The code also defines a class called Process_Data, which contains methods to prepare a dataset for machine learning tasks. This class focuses on steps like data preprocessing, feature engineering, categorical feature encoding, and calculating feature importance.

First, the read_dataset method reads a CSV file named "marketing_data.csv" and drops the 'ID' column, returning the cleaned DataFrame. Next, the preprocess method carries out preprocessing steps such as renaming columns, handling date formats, dealing with missing 'Income' values, and generating new features like 'Customer_Age', 'Num_Dependants', 'Num_TotalPurchases', and 'TotalAmount_Spent'.

The categorize method categorizes numerical features into predefined intervals to simplify data representation and analysis. The extract_cat_num_cols() method identifies categorical and numerical columns based on their data types, aiding in subsequent processing.

The encode_categorical_feats() method uses label encoding to convert categorical features into numerical values. The extract_input_output_vars() method separates input features (X) and the target variable (y), excluding specific columns for modeling. Lastly, the feat_importance_rf(), feat_importance_et(), and feat_importance_rfe() methods assess feature importance using different algorithms and provide ranked lists of features.

Overall, the Process_Data class streamlines data preparation for machine learning, encompassing tasks like preprocessing, categorization, encoding, and importance evaluation.

The code also defines a class called Helper_Plot designed to assist in creating various types of plots for data visualization. This class contains methods for generating different types of plots, including pie charts, histograms, bar plots, stacked bar plots, box plots, and correlation matrices.

One of the methods, named plot_piechart(), creates a pair of subplots: a pie chart and a horizontal bar plot. These plots help visualize the distribution and count of a categorical variable within the dataset. Another method, another_versus_response(), generates two histograms to compare the distribution of a numerical feature for responsive and non-responsive cases.

The class also provides methods for labeling segments in stacked bar plots (put_label_stacked_bar()), creating stacked bar plots for comparing two categorical variables (dist_one_vs_another_plot()), and generating box plots to show the distribution of a numerical variable against two categorical variables (box_plot()). The choose_plot() method enables users to select and create different plots based on their preferences. Similarly, the choose_category() method is tailored for exploring relationships between categorized data, particularly connections between categorical variables and the response variable.

Moreover, the class includes methods for generating a heatmap to visualize the correlation matrix of numerical features (plot_corr_mat()), and for creating bar plots to display feature importances obtained from Random Forest, Extra Trees, and Recursive Feature Elimination (RFE) methods (plot_rf_importance(), plot_et_importance(), plot_rfe_importance()).

In summary, the Helper_Plot class offers a range of functions to simplify data visualization tasks, aiding in the understanding and interpretation of relationships within the dataset.

Lastly, the code defines a Main_Class that acts as the main interface for a data visualization application. This class interacts with various other components to read and preprocess data, generate visualizations, and display them using the Tkinter library.

Upon initialization, the Main_Class sets up the Tkinter root window and dimensions. It creates instances of other classes (Design_Window, Process_Data, Helper_Plot) to manage the user interface, data processing, and visualization tasks.

The initialize method carries out a series of steps:

Reads and preprocesses the dataset using the Process_Data class.
Categorizes the dataset into bins using the categorize method of Process_Data.
Extracts categorical and numerical column names from the dataset.
Encodes categorical features and extracts input-output variables using methods from the Process_Data class.

The place_widgets() method arranges the widgets within the main window and associates a function with the "LOAD DATASET" button to display the dataset as a table when clicked.

Event binding is set up in the binds_event() method. The choose_list_widget(), choose_combobox1(), and choose_combobox2() methods are called when the user interacts with the listbox and comboboxes, respectively. These methods delegate visualization tasks to the Helper_Plot class based on the user's selection.

The shows_table() function opens a new window to display the dataset as a table using the pandastable library.

Finally, the script checks if the module is run directly and initializes the application by creating an instance of Main_Class and starting the Tkinter event loop with root.mainloop().

In essence, the Main_Class integrates various components to provide an interactive data visualization application where users can explore and analyze the dataset using a graphical interface.

FULL SOURCE CODE:

#main_class.py
import tkinter as tk
from tkinter import *
from pandastable import Table
from design_window import Design_Window
from process_data import Process_Data
from helper_plot import Helper_Plot

class Main_Class:
    def __init__(self, root):
        self.initialize()

    def initialize(self):
        self.root = root
        lebar = 1500
        tinggi = 650
        self.root.geometry(f"{lebar}x{tinggi}")
        self.root.title("TKINTER AND DATA SCIENCE")
        
        #Creates necessary objects
        self.obj_window = Design_Window()
        self.obj_data = Process_Data()
        self.obj_plot = Helper_Plot()

        #Reads dataset
        self.df = self.obj_data.preprocess()

        #Categorize dataset
        self.df_dummy = self.obj_data.categorize(self.df)

        #Extracts input and output variables
        self.cat_cols, self.num_cols = self.obj_data.extract_cat_num_cols(self.df)
        self.df_final = self.obj_data.encode_categorical_feats(self.df, self.cat_cols)
        self.X, self.y = self.obj_data.extract_input_output_vars(self.df_final)

        #Places widgets in root
        self.place_widgets()      

        #Binds event
        self.binds_event() 

    def binds_event(self):
        #Binds listbox to a function
        self.obj_window.listbox.bind("<<ListboxSelect>>", self.choose_list_widget)

        # Binds combobox1 to a function
        self.obj_window.combo1.bind("<<ComboboxSelected>>", self.choose_combobox1)

        # Binds combobox2 to a function
        self.obj_window.combo2.bind("<<ComboboxSelected>>", self.choose_combobox2)

    def place_widgets(self):    
        self.obj_window.add_widgets(self.root) 
        
        #Shows table if user clicks LOAD DATASET 
        self.obj_window.tombol.config(command=self.shows_table)       

    def shows_table(self):
       frame = Toplevel(self.root) #new window
       self.table = Table(frame, dataframe=self.df, showtoolbar=True, showstatusbar=True)
       
       # Sets dimension of Toplevel
       frame.geometry(f"{1300}x{500}")
       self.table.show()

    def choose_list_widget(self, event):
        chosen = self.obj_window.listbox.get(self.obj_window.listbox.curselection())
        print(chosen)
        self.obj_plot.choose_plot(self.df, self.df_dummy, chosen, 
            self.obj_window.figure1, self.obj_window.canvas1, 
            self.obj_window.figure2, self.obj_window.canvas2)

    def choose_combobox1(self, event):
        chosen = self.obj_window.combo1.get()
        self.obj_plot.choose_category(self.df_dummy, chosen, 
            self.obj_window.figure1, self.obj_window.canvas1, 
            self.obj_window.figure2, self.obj_window.canvas2)

    def choose_combobox2(self, event):
        chosen = self.obj_window.combo2.get()
        self.obj_plot.choose_plot_more(self.df_final, chosen, 
            self.X, self.y,
            self.obj_window.figure1, 
            self.obj_window.canvas1, self.obj_window.figure2, 
            self.obj_window.canvas2)
        
if __name__ == "__main__":
    root = tk.Tk()
    app = Main_Class(root)
    root.mainloop()

#design_window.py
import tkinter as tk
from tkinter import ttk
from matplotlib.figure import Figure
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg

class Design_Window:
    def add_widgets(self, root):
        #Adds button(s)
        self.add_buttons(root)

        #Adds canvasses
        self.add_canvas(root)

        #Adds labels
        self.add_labels(root)

        #Adds listbox widget
        self.add_listboxes(root)

        #Adds combobox widget
        self.add_comboboxes(root)

    def add_buttons(self, root):
        #Adds button
        self.tombol = tk.Button(root, height=2, width=30, text="LOAD DATASET")
        self.tombol.grid(row=0, column=0, padx=5, pady=5, sticky="w")

    def add_labels(self, root):
        #Adds labels
        self.label1 = tk.Label(root, text = "CHOOSE PLOT", fg = "red")
        self.label1.grid(row=1, column=0, padx=5, pady=1, sticky="w")

        self.label2 = tk.Label(root, text = "CHOOSE CATEGORIZED PLOT", fg = "blue")
        self.label2.grid(row=3, column=0, padx=5, pady=1, sticky="w")

        self.label2 = tk.Label(root, text = "CHOOSE FEATURES", fg = "black")
        self.label2.grid(row=5, column=0, padx=5, pady=1, sticky="w")

        self.label3 = tk.Label(root, text = "CHOOSE REGRESSORS", fg = "green")
        self.label3.grid(row=7, column=0, padx=5, pady=1, sticky="w")

        self.label4 = tk.Label(root, text = "CHOOSE MACHINE LEARNING", fg = "blue")
        self.label4.grid(row=9, column=0, padx=5, pady=1, sticky="w")

        self.label5 = tk.Label(root, text = "CHOOSE DEEP LEARNING", fg = "red")
        self.label5.grid(row=11, column=0, padx=5, pady=1, sticky="w")

    def add_canvas(self, root):
        #Menambahkan canvas1 widget pada root untuk menampilkan hasil
        self.figure1 = Figure(figsize=(6.2, 6), dpi=100)
        self.figure1.patch.set_facecolor("lightgray")
        self.canvas1 = FigureCanvasTkAgg(self.figure1, master=root)
        self.canvas1.get_tk_widget().grid(row=0, column=1, columnspan=1, rowspan=25, padx=5, pady=5, sticky="n")

        #Menambahkan canvas2 widget pada root untuk menampilkan hasil
        self.figure2 = Figure(figsize=(6.2, 6), dpi=100)
        self.figure2.patch.set_facecolor("lightgray")
        self.canvas2 = FigureCanvasTkAgg(self.figure2, master=root)
        self.canvas2.get_tk_widget().grid(row=0, column=2, columnspan=1, rowspan=25, padx=5, pady=5, sticky="n")

    def add_listboxes(self, root):
        #Menambahkan list widget
        self.listbox = tk.Listbox(root, selectmode=tk.SINGLE, width=35)
        self.listbox.grid(row=2, column=0, sticky='n', padx=5, pady=1)

        # Menyisipkan item ke dalam list widget
        items = ["Marital Status", "Education", "Country", 
                 "Age Group", "Education with Response 0", "Education with Response 1",
                 "Country with Response 0", "Country with Response 1", 
                 "Customer Age", "Income", "Mount of Wines",
                 "Education versus Response", "Age Group versus Response",
                 "Marital Status versus Response", "Country versus Response",
                 "Number of Dependants versus Response",
                 "Country versus Customer Age Per Education",
                 "Num_TotalPurchases versus Education Per Marital Status"]
        for item in items:
            self.listbox.insert(tk.END, item)

        self.listbox.config(height=len(items)) 

    def add_comboboxes(self, root):
        # Create ComboBoxes
        self.combo1 = ttk.Combobox(root, width=32)
        self.combo1["values"] = ["Categorized Income versus Response", 
            "Categorized Total Purchase versus Categorized Income",
            "Categorized Recency versus Categorized Total Purchase",
            "Categorized Customer Month versus Categorized Customer Age",
            "Categorized Mount of Gold Products versus Categorized Income",
            "Categorized Mount of Fish Products versus Categorized Total AmountSpent",
            "Categorized Mount of Meat Products versus Categorized Recency",
            "Distribution of Numerical Columns"]
        self.combo1.grid(row=4, column=0, padx=5, pady=1, sticky="n")

        self.combo2 = ttk.Combobox(root, width=32)
        self.combo2["values"] = ["Correlation Matrix", "RF Features Importance",
            "ET Features Importance", "RFE Features Importance"]
        self.combo2.grid(row=6, column=0, padx=5, pady=1, sticky="n")

        self.combo3 = ttk.Combobox(root, width=32)
        self.combo3["values"] = ["Linear Regression", "RF Regression",
            "Decision Trees Regression", "KNN Regression",
            "AdaBoost Regression", "Gradient Boosting Regression",
            "XGB Regression", "LGB Regression", "CatBoost Regression",
            "SVR Regression", "Lasso Regression", "Ridge Regression"]
        self.combo3.grid(row=8, column=0, padx=5, pady=1, sticky="n")

        self.combo4 = ttk.Combobox(root, width=32)
        self.combo4["values"] = ["Logistic Regression", "Random Forest",
            "Decision Trees", "KNN",
            "AdaBoost", "Gradient Boosting",
            "Extreme Gradient Boosting", "Light Gradient Boosting", 
            "Multi-Layer Perceptron", "Support Vector Machine"]
        self.combo4.grid(row=10, column=0, padx=5, pady=1, sticky="n")

        self.combo5 = ttk.Combobox(root, width=32)
        self.combo5["values"] = ["Long-Short Term", "Convolutional NN", "Recurrent NN", "Feed-Forward NN", "Artifical NN"]
        self.combo5.grid(row=12, column=0, padx=5, pady=1, sticky="n")

#process_data.py
import os
import numpy as np 
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

class Process_Data:
    def read_dataset(self):
        #Reads dataset
        curr_path = os.getcwd() 
        df = pd.read_csv(curr_path+"/marketing_data.csv")

        #Drops ID column
        df = df.drop("ID", axis = 1)

        return df
    
    def preprocess(self):
        df = self.read_dataset()

        #Renames column name and corrects data type
        df.rename(columns={' Income ':'Income'},inplace=True)
        df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], format='%m/%d/%y')  
        df["Income"] = df["Income"].str.replace("$","").str.replace(",","") 
        df["Income"] = df["Income"].astype(float)

        #Checks null values
        print(df.isnull().sum())
        print('Total number of null values: ', df.isnull().sum().sum())

        #Imputes Income column with median values
        df['Income'] = df['Income'].fillna(df['Income'].median())
        print(f'Number of Null values in "Income" after Imputation: {df["Income"].isna().sum()}')

        #Transformasi Dt_Customer
        df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])
        print(f'After Transformation:\n{df["Dt_Customer"].head()}')
        df['Customer_Age'] = df['Dt_Customer'].dt.year - df['Year_Birth']

        #Creates number of children/dependents in home by adding 'Kidhome' and 'Teenhome' features
        #Creates number of Total_Purchases by adding all the purchases features
        #Creates TotalAmount_Spent by adding all the Mnt* features
        df['Dt_Customer_Month'] = df['Dt_Customer'].dt.month
        df['Dt_Customer_Year'] = df['Dt_Customer'].dt.year
        df['Num_Dependants'] = df['Kidhome'] + df['Teenhome']    

        purchase_features = [c for c in df.columns if 'Purchase' in str(c)]
        #Removes 'NumDealsPurchases' from the list above
        purchase_features.remove('NumDealsPurchases')
        df['Num_TotalPurchases'] = df[purchase_features].sum(axis = 1)

        amt_spent_features = [c for c in df.columns if 'Mnt' in str(c)]
        df['TotalAmount_Spent'] = df[amt_spent_features].sum(axis = 1)  

        #Creates a categorical feature using the customer's age by binnning them, 
        #to help understanding purchasing behaviour
        print(f'Min. Customer Age: {df["Customer_Age"].min()}')
        print(f'Max. Customer Age: {df["Customer_Age"].max()}')
        df['AgeGroup'] = pd.cut(df['Customer_Age'], bins = [6, 24, 29, 40, 56, 75], 
             labels = ['Gen-Z', 'Gen-Y.1', 'Gen-Y.2', 'Gen-X', 'BBoomers'])

        return df  

    def categorize(self, df):
        #Creates a dummy dataframe for visualization
        df_dummy=df.copy()

        #Categorizes Income feature
        labels = ['0-20k', '20k-30k', '30k-50k','50k-70k','70k-700k']
        df_dummy['Income'] = pd.cut(df_dummy['Income'], 
            [0, 20000, 30000, 50000, 70000, 700000], labels=labels)        

        #Categorizes TotalAmount_Spent feature
        labels = ['0-200', '200-500', '500-800','800-1000','1000-3000']
        df_dummy['TotalAmount_Spent'] = pd.cut(df_dummy['TotalAmount_Spent'], 
            [0, 200, 500, 800, 1000, 3000], labels=labels)

        #Categorizes Num_TotalPurchases feature
        labels = ['0-5', '5-10', '10-15','15-25','25-35']
        df_dummy['Num_TotalPurchases'] = pd.cut(df_dummy['Num_TotalPurchases'], 
            [0, 5, 10, 15, 25, 35], labels=labels)

        #Categorizes Dt_Customer_Year feature
        labels = ['2012', '2013', '2014']
        df_dummy['Dt_Customer_Year'] = pd.cut(df_dummy['Dt_Customer_Year'], 
            [0, 2012, 2013, 2014], labels=labels)

        #Categorizes Dt_Customer_Month feature
        labels = ['0-3', '3-6', '6-9','9-12']
        df_dummy['Dt_Customer_Month'] = pd.cut(df_dummy['Dt_Customer_Month'], 
            [0, 3, 6, 9, 12], labels=labels)

        #Categorizes Customer_Age feature
        labels = ['0-30', '30-40', '40-50', '40-60','60-120']
        df_dummy['Customer_Age'] = pd.cut(df_dummy['Customer_Age'], 
            [0, 30, 40, 50, 60, 120], labels=labels)

        #Categorizes MntGoldProds feature
        labels = ['0-30', '30-50', '50-80', '80-100','100-400']
        df_dummy['MntGoldProds'] = pd.cut(df_dummy['MntGoldProds'], 
            [0, 30, 50, 80, 100, 400], labels=labels)

        #Categorizes MntSweetProducts feature
        labels = ['0-10', '10-20', '20-40', '40-100','100-300']
        df_dummy['MntSweetProducts'] = pd.cut(df_dummy['MntSweetProducts'], 
            [0, 10, 20, 40, 100, 300], labels=labels)

        #Categorizes MntFishProducts feature
        labels = ['0-10', '10-20', '20-40', '40-100','100-300']
        df_dummy['MntFishProducts'] = pd.cut(df_dummy['MntFishProducts'], 
            [0, 10, 20, 40, 100, 300], labels=labels)

        #Categorizes MntMeatProducts feature
        labels = ['0-50', '50-100', '100-200', '200-500','500-2000']
        df_dummy['MntMeatProducts'] = pd.cut(df_dummy['MntMeatProducts'], 
            [0, 50, 100, 200, 500, 2000], labels=labels)

        #Categorizes MntFruits feature
        labels = ['0-10', '10-30', '30-50', '50-100','100-200']
        df_dummy['MntFruits'] = pd.cut(df_dummy['MntFruits'], 
            [0, 1, 30, 50, 100, 200], labels=labels)

        #Categorizes MntWines feature
        labels = ['0-100', '100-300', '300-500', '500-1000','1000-1500']
        df_dummy['MntWines'] = pd.cut(df_dummy['MntWines'], 
            [0, 100, 300, 500, 1000, 1500], labels=labels)

        #Categorizes Recency feature
        labels = ['0-10', '10-30', '30-50', '50-80','80-100']
        df_dummy['Recency'] = pd.cut(df_dummy['Recency'], 
            [0, 10, 30, 50, 80, 100], labels=labels)

        return df_dummy

    def extract_cat_num_cols(self, df):
        #Extracts categorical and numerical columns in dummy dataset
        cat_cols = [col for col in df.columns if 
            (df[col].dtype == 'object') or (df[col].dtype.name == 'category')]
        num_cols = [col for col in df.columns if 
            (df[col].dtype != 'object') and (df[col].dtype.name != 'category')]
        
        return cat_cols, num_cols
    
    def encode_categorical_feats(self, df, cat_cols):
        #Encodes categorical features in original dataset     
        print(f'Features that needs to be Label Encoded: \n{cat_cols}')

        for c in cat_cols:
            lbl = LabelEncoder()
            lbl.fit(list(df[c].astype(str).values))
            df[c] = lbl.transform(list(df[c].astype(str).values))
        print('Label Encoding done..')  
        return df  

    def extract_input_output_vars(self, df): 
        #Extracts output and input variables
        y = df['Response'].values # Target for the model
        X = df.drop(['Dt_Customer', 'Year_Birth', 'Response'], axis = 1)  

        return X, y     

    def feat_importance_rf(self, X, y):
        names = X.columns
        rf = RandomForestClassifier()
        rf.fit(X, y)

        result_rf = pd.DataFrame()
        result_rf['Features'] = X.columns
        result_rf ['Values'] = rf.feature_importances_
        result_rf.sort_values('Values', inplace = True, ascending = False)

        return result_rf
    
    def feat_importance_et(self, X, y):
        model = ExtraTreesClassifier()
        model.fit(X, y)

        result_et = pd.DataFrame()
        result_et['Features'] = X.columns
        result_et ['Values'] = model.feature_importances_
        result_et.sort_values('Values', inplace=True, ascending =False)

        return result_et    
    
    def feat_importance_rfe(self, X, y):
        model = LogisticRegression()
        #Creates the RFE model
        rfe = RFE(model)
        rfe = rfe.fit(X, y)

        result_lg = pd.DataFrame()
        result_lg['Features'] = X.columns
        result_lg ['Ranking'] = rfe.ranking_
        result_lg.sort_values('Ranking', inplace=True , ascending = False)

        return result_lg      

#helper_plot.py
import seaborn as sns
import numpy as np 
from process_data import Process_Data

class Helper_Plot:
    def __init__(self):
        self.obj_data = Process_Data()

    # Defines function to create pie chart and bar plot as subplots   
    def plot_piechart(self, df, var, figure, canvas, title=''):
        figure.clear()

        # Pie Chart (Subplot kiri)
        plot1 = figure.add_subplot(2,1,1)        
        label_list = list(df[var].value_counts().index)
        colors = sns.color_palette("deep", len(label_list))  
        _, _, autopcts = plot1.pie(df[var].value_counts(), autopct="%1.1f%%", colors=colors,
            startangle=30, labels=label_list,
            wedgeprops={"linewidth": 2, "edgecolor": "white"},  # Add white edge
            shadow=True, textprops={'fontsize': 7})
        plot1.set_title("Distribution of " + var + " variable " + title, fontsize=10)

        # Bar Plot (Subplot Kanan)
        plot2 = figure.add_subplot(2,1,2)
        ax = df[var].value_counts().plot(kind="barh", color=colors, alpha=0.8, ax = plot2) 
        for i, j in enumerate(df[var].value_counts().values):
            ax.text(.7, i, j, weight="bold", fontsize=7)

        plot2.set_title("Count of " + var + " cases " + title, fontsize=10)

        figure.tight_layout()
        canvas.draw()

    def another_versus_response(self, df, feat, num_bins, figure, canvas):
        figure.clear()
        plot1 = figure.add_subplot(2,1,1)

        colors = sns.color_palette("Set2")
        df[df['Response'] == 0][feat].plot(ax=plot1, kind='hist', bins=num_bins, edgecolor='black', color=colors[0])
        plot1.set_title('Not Responsive', fontsize=15)
        plot1.set_xlabel(feat, fontsize=10)
        plot1.set_ylabel('Count', fontsize=10)
        data1 = []
        for p in plot1.patches:
            x = p.get_x() + p.get_width() / 2.
            y = p.get_height()
            plot1.annotate(format(y, '.0f'), (x, y), ha='center',
                     va='center', xytext=(0, 10),
                     weight="bold", fontsize=7, textcoords='offset points')
            data1.append([x, y])

        plot2 = figure.add_subplot(2,1,2)
        df[df['Response'] == 1][feat].plot(ax=plot2, kind='hist', bins=num_bins, edgecolor='black', color=colors[1])
        plot2.set_title('Responsive', fontsize=15)
        plot2.set_xlabel(feat, fontsize=10)
        plot2.set_ylabel('Count', fontsize=10)
        data2 = []
        for p in plot2.patches:
            x = p.get_x() + p.get_width() / 2.
            y = p.get_height()
            plot2.annotate(format(y, '.0f'), (x, y), ha='center',
                     va='center', xytext=(0, 10),
                     weight="bold", fontsize=7, textcoords='offset points')
            data2.append([x, y])

        figure.tight_layout()
        canvas.draw()

    #Puts label inside stacked bar
    def put_label_stacked_bar(self, ax,fontsize):
        #patches is everything inside of the chart
        for rect in ax.patches:
            # Find where everything is located
            height = rect.get_height()
            width = rect.get_width()
            x = rect.get_x()
            y = rect.get_y()
    
            # The height of the bar is the data value and can be used as the label
            label_text = f'{height:.0f}'  
    
            # ax.text(x, y, text)
            label_x = x + width / 2
            label_y = y + height / 2

            # plots only when height is greater than specified value
            if height > 0:
                ax.text(label_x, label_y, label_text, \
                    ha='center', va='center', \
                    weight = "bold",fontsize=fontsize)
    
    #Plots one variable against another variable
    def dist_one_vs_another_plot(self, df, cat1, cat2, figure, canvas, title):
        figure.clear()
        plot1 = figure.add_subplot(1,1,1)

        group_by_stat = df.groupby([cat1, cat2]).size()
        colors = sns.color_palette("Set2", len(df[cat1].unique()))
        stacked_data = group_by_stat.unstack()
        group_by_stat.unstack().plot(kind='bar', stacked=True, ax=plot1, grid=True, color=colors)
        plot1.set_title(title, fontsize=12)
        plot1.set_ylabel('Number of Cases', fontsize=10)
        plot1.set_xlabel(cat1, fontsize=10)
        self.put_label_stacked_bar(plot1,7)
        # Set font for tick labels
        plot1.tick_params(axis='both', which='major', labelsize=8)
        plot1.tick_params(axis='both', which='minor', labelsize=8)    
        plot1.legend(fontsize=8)    
        figure.tight_layout()
        canvas.draw()

    def box_plot(self, df, x, y, hue, figure, canvas, title):
        figure.clear()
        plot1 = figure.add_subplot(1,1,1)

        #Creates boxplot of Num_TotalPurchases versus Num_Dependants
        sns.boxplot(data = df, x = x, y = y, hue = hue, ax=plot1)
        plot1.set_title(title, fontsize=14)
        plot1.set_xlabel(x, fontsize=10)
        plot1.set_ylabel(y, fontsize=10)
        figure.tight_layout()
        canvas.draw()

    def choose_plot(self, df1, df2, chosen, figure1, canvas1, figure2, canvas2):
        print(chosen)
        if chosen == "Marital Status":
            self.plot_piechart(df2, "Marital_Status", figure1, canvas1)

        elif chosen == "Education":
            self.plot_piechart(df2, "Education", figure2, canvas2)

        elif chosen == "Country":
            self.plot_piechart(df2, "Country", figure1, canvas1)            

        elif chosen == "Age Group":
            self.plot_piechart(df2, "AgeGroup", figure2, canvas2)              

        elif chosen == "Age Group":
            self.plot_piechart(df2, "AgeGroup", figure2, canvas2) 

        elif chosen == "Education with Response 0":
            self.plot_piechart(df2[df2.Response==0], "Education", figure1, canvas1, " with Response 0")

        elif chosen == "Education with Response 1":
            self.plot_piechart(df2[df2.Response==1], "Education", figure2, canvas2, " with Response 1")

        elif chosen == "Country with Response 0":
            self.plot_piechart(df2[df2.Response==0], "Country", figure1, canvas1, " with Response 0")

        elif chosen == "Country with Response 1":
            self.plot_piechart(df2[df2.Response==1], "Country", figure2, canvas2, " with Response 1")       

        elif chosen == "Income":
            self.another_versus_response(df1, "Income", 32, figure1, canvas1) 

        elif chosen == "Mount of Wines":
            self.another_versus_response(df1, "MntWines", 32, figure2, canvas2) 

        elif chosen == "Customer Age":
            self.another_versus_response(df1, "Customer_Age", 32, figure1, canvas1) 

        elif chosen == "Education versus Response":
            self.dist_one_vs_another_plot(df2, "Education", "Response", figure2, canvas2, chosen) 

        elif chosen == "Age Group versus Response":
            self.dist_one_vs_another_plot(df2, "AgeGroup", "Response", figure1, canvas1, chosen)

        elif chosen == "Marital Status versus Response":
            self.dist_one_vs_another_plot(df2, "Marital_Status", "Response", figure2, canvas2, chosen)            

        elif chosen == "Country versus Response":
            self.dist_one_vs_another_plot(df2, "Country", "Response", figure1, canvas1, chosen)              

        elif chosen == "Number of Dependants versus Response":
            self.dist_one_vs_another_plot(df2, "Num_Dependants", "Response", figure2, canvas2, chosen) 

        elif chosen == "Country versus Customer Age Per Education":
            self.box_plot(df1, "Country", "Customer_Age", "Education", figure1, canvas1, chosen)

        elif chosen == "Num_TotalPurchases versus Education Per Marital Status":
            self.box_plot(df1, "Education", "Num_TotalPurchases", "Marital_Status", figure2, canvas2, chosen)

    def choose_category(self, df, chosen, figure1, canvas1, figure2, canvas2):  
        if chosen == "Categorized Income versus Response":
            self.dist_one_vs_another_plot(df, "Income", "Response", figure1, canvas1, chosen)       

        if chosen == "Categorized Total Purchase versus Categorized Income":
            self.dist_one_vs_another_plot(df, "Num_TotalPurchases", "Income", figure2, canvas2, chosen)      

        if chosen == "Categorized Recency versus Categorized Total Purchase":
            self.dist_one_vs_another_plot(df, "Recency", "Num_TotalPurchases", figure1, canvas1, chosen)    

        if chosen == "Categorized Customer Month versus Categorized Customer Age":
            self.dist_one_vs_another_plot(df, "Dt_Customer_Month", "Customer_Age", figure2, canvas2, chosen) 

        if chosen == "Categorized Mount of Gold Products versus Categorized Income":
            self.dist_one_vs_another_plot(df, "MntGoldProds", "Income", figure1, canvas1, chosen) 

        if chosen == "Categorized Mount of Fish Products versus Categorized Total AmountSpent":
            self.dist_one_vs_another_plot(df, "MntFishProducts", "TotalAmount_Spent", figure2, canvas2, chosen) 

        if chosen == "Categorized Mount of Meat Products versus Categorized Recency":
            self.dist_one_vs_another_plot(df, "MntMeatProducts", "Recency", figure1, canvas1, chosen) 

    def plot_corr_mat(self, df, figure, canvas):
        figure.clear()    
        plot1 = figure.add_subplot(1,1,1)  
        categorical_columns = df.select_dtypes(include=['object', 'category']).columns 
        df_removed = df.drop(columns=categorical_columns) 
        corrdata = df_removed.corr()

        annot_kws = {"size": 5}
        sns.heatmap(corrdata, ax = plot1, lw=1, annot=True, cmap="Reds", annot_kws=annot_kws)
        plot1.set_title('Correlation Matrix', fontweight ="bold",fontsize=14)

        # Set font for x and y labels
        plot1.set_xlabel('Features', fontweight="bold", fontsize=12)
        plot1.set_ylabel('Features', fontweight="bold", fontsize=12)

        # Set font for tick labels
        plot1.tick_params(axis='both', which='major', labelsize=5)
        plot1.tick_params(axis='both', which='minor', labelsize=5)

        figure.tight_layout()
        canvas.draw()

    def plot_rf_importance(self, X, y, figure, canvas):
        result_rf = self.obj_data.feat_importance_rf(X, y)
        figure.clear()    
        plot1 = figure.add_subplot(1,1,1)  
        sns.set_color_codes("pastel")
        ax=sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Blue", ax=plot1)
        plot1.set_title('Random Forest Features Importance', fontweight ="bold",fontsize=14)

        plot1.set_xlabel('Features Importance',  fontsize=10) 
        plot1.set_ylabel('Feature Labels',  fontsize=10) 
        # Set font for tick labels
        plot1.tick_params(axis='both', which='major', labelsize=5)
        plot1.tick_params(axis='both', which='minor', labelsize=5)
        figure.tight_layout()
        canvas.draw()

    def plot_et_importance(self, X, y, figure, canvas):
        result_rf = self.obj_data.feat_importance_et(X, y)
        figure.clear()    
        plot1 = figure.add_subplot(1,1,1)  
        sns.set_color_codes("pastel")
        ax=sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Red", ax=plot1)
        plot1.set_title('Extra Trees Features Importance', fontweight ="bold",fontsize=14)

        plot1.set_xlabel('Features Importance',  fontsize=10) 
        plot1.set_ylabel('Feature Labels',  fontsize=10) 
        # Set font for tick labels
        plot1.tick_params(axis='both', which='major', labelsize=5)
        plot1.tick_params(axis='both', which='minor', labelsize=5)
        figure.tight_layout()
        canvas.draw()        

    def plot_rfe_importance(self, X, y, figure, canvas):
        result_lg = self.obj_data.feat_importance_rfe(X, y)
        figure.clear()    
        plot1 = figure.add_subplot(1,1,1)  
        sns.set_color_codes("pastel")
        ax=sns.barplot(x = 'Ranking',y = 'Features', data=result_lg, color="orange", ax=plot1)
        plot1.set_title('RFE Features Importance', fontweight ="bold",fontsize=14)

        plot1.set_xlabel('Features Importance',  fontsize=10) 
        plot1.set_ylabel('Feature Labels',  fontsize=10) 
        # Set font for tick labels
        plot1.tick_params(axis='both', which='major', labelsize=5)
        plot1.tick_params(axis='both', which='minor', labelsize=5)
        figure.tight_layout()
        canvas.draw()   

    def choose_plot_more(self, df, chosen, X, y, figure1, canvas1, figure2, canvas2):  
        if chosen == "Correlation Matrix":
            self.plot_corr_mat(df, figure1, canvas1)

        if chosen == "RF Features Importance":
            self.plot_rf_importance(X, y, figure2, canvas2)

        if chosen == "ET Features Importance":
            self.plot_et_importance(X, y, figure1, canvas1)

        if chosen == "RFE Features Importance":
            self.plot_rfe_importance(X, y, figure1, canvas1)

Software Developer and Writer

Thursday, August 31, 2023

TKINTER AND DATA SCIENCE: PART 1 (FULL SOURCE CODE)

No comments:

Post a Comment

© Copyright (2017),VIVIAN SIAHAAN,All Rights Reserved.

Official Blog | Kontak Kami

Blog Design By VIVIAN SIAHAAN

Content Design By VIVIAN SIAHAAN