If you find our source code useful, please support and subscribe our channel (DUKUNG CHANNEL KAMI DENGAN MENSUBSCRIBE):
This source code is divided into four separate classes, each with its own purpose. This code performs preprocessing to prepare data for machine learning and deep learning algorithms (which will be provided in part 2).
In this code, it will display the distribution of each feature and categorized features. The code also displays the correlation matrix and features importance.
Kode sumber ini dibagi empat kelas terpisah, dimana tiap kelas memiliki tujuan masing-masing. Kode ini melakukan preprocessing untuk menyiapkan data bagi algoritma-algoritma machine learning dan deep learning (nanti akan kami sediakan pada part-2).
Pada kode ini, Anda akan menampilkan distribusi tiap fitur dan fitur terkategorisasi. Kode juga menampilkan korelasi matriks dan features importance.
BALIGE ACADEMY TEAM
VIVIAN SIAHAAN
RISMON HASIHOLAN SIANIPAR
HORAS!!!
EXPLAINATION:
The class named Design_Window is intended for designing and setting up a graphical user interface (GUI) using the Tkinter library. The class consists of methods to add various widgets, labels, buttons, canvases, listboxes, and comboboxes to the GUI.
- The add_widgets() method adds different types of widgets to the root window. It calls other methods to add buttons, canvases, labels, listboxes, and comboboxes.
- The add_buttons() method adds a button labeled "LOAD DATASET" to the GUI. It is positioned at row 0, column 0 and is given specific dimensions.
- The add_labels() method adds multiple labels to the GUI, each indicating a different category. The labels are colored and positioned in rows 1, 3, 5, 7, 9, and 11 with distinctive colors.
- The add_canvas() method adds two canvases (canvas1 and canvas2) to the GUI. These canvases are used to display visual outputs such as plots. Each canvas is created using the Figure class from Matplotlib and is placed in different columns of the GUI.
- The add_listboxes() method adds a listbox widget to the GUI. It allows users to select items from a predefined list. The listbox is filled with items related to various feature names and categories.
- The add_comboboxes() method adds five combobox widgets to the GUI. Each combobox allows users to select predefined options from dropdown menus. The options in the comboboxes relate to different types of plots, regressors, machine learning models, and deep learning models.
Overall, the Design_Window class is aimed at creating a user-friendly GUI for interactive data analysis and visualization, along with the selection of different algorithms and models for further analysis.
The code also defines a class called Process_Data, which contains methods to prepare a dataset for machine learning tasks. This class focuses on steps like data preprocessing, feature engineering, categorical feature encoding, and calculating feature importance.
First, the read_dataset method reads a CSV file named "marketing_data.csv" and drops the 'ID' column, returning the cleaned DataFrame. Next, the preprocess method carries out preprocessing steps such as renaming columns, handling date formats, dealing with missing 'Income' values, and generating new features like 'Customer_Age', 'Num_Dependants', 'Num_TotalPurchases', and 'TotalAmount_Spent'.
The categorize method categorizes numerical features into predefined intervals to simplify data representation and analysis. The extract_cat_num_cols() method identifies categorical and numerical columns based on their data types, aiding in subsequent processing.
The encode_categorical_feats() method uses label encoding to convert categorical features into numerical values. The extract_input_output_vars() method separates input features (X) and the target variable (y), excluding specific columns for modeling. Lastly, the feat_importance_rf(), feat_importance_et(), and feat_importance_rfe() methods assess feature importance using different algorithms and provide ranked lists of features.
Overall, the Process_Data class streamlines data preparation for machine learning, encompassing tasks like preprocessing, categorization, encoding, and importance evaluation.
The code also defines a class called Helper_Plot designed to assist in creating various types of plots for data visualization. This class contains methods for generating different types of plots, including pie charts, histograms, bar plots, stacked bar plots, box plots, and correlation matrices.
One of the methods, named plot_piechart(), creates a pair of subplots: a pie chart and a horizontal bar plot. These plots help visualize the distribution and count of a categorical variable within the dataset. Another method, another_versus_response(), generates two histograms to compare the distribution of a numerical feature for responsive and non-responsive cases.
The class also provides methods for labeling segments in stacked bar plots (put_label_stacked_bar()), creating stacked bar plots for comparing two categorical variables (dist_one_vs_another_plot()), and generating box plots to show the distribution of a numerical variable against two categorical variables (box_plot()). The choose_plot() method enables users to select and create different plots based on their preferences. Similarly, the choose_category() method is tailored for exploring relationships between categorized data, particularly connections between categorical variables and the response variable.
Moreover, the class includes methods for generating a heatmap to visualize the correlation matrix of numerical features (plot_corr_mat()), and for creating bar plots to display feature importances obtained from Random Forest, Extra Trees, and Recursive Feature Elimination (RFE) methods (plot_rf_importance(), plot_et_importance(), plot_rfe_importance()).
In summary, the Helper_Plot class offers a range of functions to simplify data visualization tasks, aiding in the understanding and interpretation of relationships within the dataset.
Lastly, the code defines a Main_Class that acts as the main interface for a data visualization application. This class interacts with various other components to read and preprocess data, generate visualizations, and display them using the Tkinter library.
Upon initialization, the Main_Class sets up the Tkinter root window and dimensions. It creates instances of other classes (Design_Window, Process_Data, Helper_Plot) to manage the user interface, data processing, and visualization tasks.
The initialize method carries out a series of steps:
- Reads and preprocesses the dataset using the Process_Data class.
- Categorizes the dataset into bins using the categorize method of Process_Data.
- Extracts categorical and numerical column names from the dataset.
- Encodes categorical features and extracts input-output variables using methods from the Process_Data class.
The place_widgets() method arranges the widgets within the main window and associates a function with the "LOAD DATASET" button to display the dataset as a table when clicked.
Event binding is set up in the binds_event() method. The choose_list_widget(), choose_combobox1(), and choose_combobox2() methods are called when the user interacts with the listbox and comboboxes, respectively. These methods delegate visualization tasks to the Helper_Plot class based on the user's selection.
The shows_table() function opens a new window to display the dataset as a table using the pandastable library.
Finally, the script checks if the module is run directly and initializes the application by creating an instance of Main_Class and starting the Tkinter event loop with root.mainloop().
In essence, the Main_Class integrates various components to provide an interactive data visualization application where users can explore and analyze the dataset using a graphical interface.
FULL SOURCE CODE:
#main_class.py import tkinter as tk from tkinter import * from pandastable import Table from design_window import Design_Window from process_data import Process_Data from helper_plot import Helper_Plot class Main_Class: def __init__(self, root): self.initialize() def initialize(self): self.root = root lebar = 1500 tinggi = 650 self.root.geometry(f"{lebar}x{tinggi}") self.root.title("TKINTER AND DATA SCIENCE") #Creates necessary objects self.obj_window = Design_Window() self.obj_data = Process_Data() self.obj_plot = Helper_Plot() #Reads dataset self.df = self.obj_data.preprocess() #Categorize dataset self.df_dummy = self.obj_data.categorize(self.df) #Extracts input and output variables self.cat_cols, self.num_cols = self.obj_data.extract_cat_num_cols(self.df) self.df_final = self.obj_data.encode_categorical_feats(self.df, self.cat_cols) self.X, self.y = self.obj_data.extract_input_output_vars(self.df_final) #Places widgets in root self.place_widgets() #Binds event self.binds_event() def binds_event(self): #Binds listbox to a function self.obj_window.listbox.bind("<<ListboxSelect>>", self.choose_list_widget) # Binds combobox1 to a function self.obj_window.combo1.bind("<<ComboboxSelected>>", self.choose_combobox1) # Binds combobox2 to a function self.obj_window.combo2.bind("<<ComboboxSelected>>", self.choose_combobox2) def place_widgets(self): self.obj_window.add_widgets(self.root) #Shows table if user clicks LOAD DATASET self.obj_window.tombol.config(command=self.shows_table) def shows_table(self): frame = Toplevel(self.root) #new window self.table = Table(frame, dataframe=self.df, showtoolbar=True, showstatusbar=True) # Sets dimension of Toplevel frame.geometry(f"{1300}x{500}") self.table.show() def choose_list_widget(self, event): chosen = self.obj_window.listbox.get(self.obj_window.listbox.curselection()) print(chosen) self.obj_plot.choose_plot(self.df, self.df_dummy, chosen, self.obj_window.figure1, self.obj_window.canvas1, self.obj_window.figure2, self.obj_window.canvas2) def choose_combobox1(self, event): chosen = self.obj_window.combo1.get() self.obj_plot.choose_category(self.df_dummy, chosen, self.obj_window.figure1, self.obj_window.canvas1, self.obj_window.figure2, self.obj_window.canvas2) def choose_combobox2(self, event): chosen = self.obj_window.combo2.get() self.obj_plot.choose_plot_more(self.df_final, chosen, self.X, self.y, self.obj_window.figure1, self.obj_window.canvas1, self.obj_window.figure2, self.obj_window.canvas2) if __name__ == "__main__": root = tk.Tk() app = Main_Class(root) root.mainloop() #design_window.py import tkinter as tk from tkinter import ttk from matplotlib.figure import Figure from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg class Design_Window: def add_widgets(self, root): #Adds button(s) self.add_buttons(root) #Adds canvasses self.add_canvas(root) #Adds labels self.add_labels(root) #Adds listbox widget self.add_listboxes(root) #Adds combobox widget self.add_comboboxes(root) def add_buttons(self, root): #Adds button self.tombol = tk.Button(root, height=2, width=30, text="LOAD DATASET") self.tombol.grid(row=0, column=0, padx=5, pady=5, sticky="w") def add_labels(self, root): #Adds labels self.label1 = tk.Label(root, text = "CHOOSE PLOT", fg = "red") self.label1.grid(row=1, column=0, padx=5, pady=1, sticky="w") self.label2 = tk.Label(root, text = "CHOOSE CATEGORIZED PLOT", fg = "blue") self.label2.grid(row=3, column=0, padx=5, pady=1, sticky="w") self.label2 = tk.Label(root, text = "CHOOSE FEATURES", fg = "black") self.label2.grid(row=5, column=0, padx=5, pady=1, sticky="w") self.label3 = tk.Label(root, text = "CHOOSE REGRESSORS", fg = "green") self.label3.grid(row=7, column=0, padx=5, pady=1, sticky="w") self.label4 = tk.Label(root, text = "CHOOSE MACHINE LEARNING", fg = "blue") self.label4.grid(row=9, column=0, padx=5, pady=1, sticky="w") self.label5 = tk.Label(root, text = "CHOOSE DEEP LEARNING", fg = "red") self.label5.grid(row=11, column=0, padx=5, pady=1, sticky="w") def add_canvas(self, root): #Menambahkan canvas1 widget pada root untuk menampilkan hasil self.figure1 = Figure(figsize=(6.2, 6), dpi=100) self.figure1.patch.set_facecolor("lightgray") self.canvas1 = FigureCanvasTkAgg(self.figure1, master=root) self.canvas1.get_tk_widget().grid(row=0, column=1, columnspan=1, rowspan=25, padx=5, pady=5, sticky="n") #Menambahkan canvas2 widget pada root untuk menampilkan hasil self.figure2 = Figure(figsize=(6.2, 6), dpi=100) self.figure2.patch.set_facecolor("lightgray") self.canvas2 = FigureCanvasTkAgg(self.figure2, master=root) self.canvas2.get_tk_widget().grid(row=0, column=2, columnspan=1, rowspan=25, padx=5, pady=5, sticky="n") def add_listboxes(self, root): #Menambahkan list widget self.listbox = tk.Listbox(root, selectmode=tk.SINGLE, width=35) self.listbox.grid(row=2, column=0, sticky='n', padx=5, pady=1) # Menyisipkan item ke dalam list widget items = ["Marital Status", "Education", "Country", "Age Group", "Education with Response 0", "Education with Response 1", "Country with Response 0", "Country with Response 1", "Customer Age", "Income", "Mount of Wines", "Education versus Response", "Age Group versus Response", "Marital Status versus Response", "Country versus Response", "Number of Dependants versus Response", "Country versus Customer Age Per Education", "Num_TotalPurchases versus Education Per Marital Status"] for item in items: self.listbox.insert(tk.END, item) self.listbox.config(height=len(items)) def add_comboboxes(self, root): # Create ComboBoxes self.combo1 = ttk.Combobox(root, width=32) self.combo1["values"] = ["Categorized Income versus Response", "Categorized Total Purchase versus Categorized Income", "Categorized Recency versus Categorized Total Purchase", "Categorized Customer Month versus Categorized Customer Age", "Categorized Mount of Gold Products versus Categorized Income", "Categorized Mount of Fish Products versus Categorized Total AmountSpent", "Categorized Mount of Meat Products versus Categorized Recency", "Distribution of Numerical Columns"] self.combo1.grid(row=4, column=0, padx=5, pady=1, sticky="n") self.combo2 = ttk.Combobox(root, width=32) self.combo2["values"] = ["Correlation Matrix", "RF Features Importance", "ET Features Importance", "RFE Features Importance"] self.combo2.grid(row=6, column=0, padx=5, pady=1, sticky="n") self.combo3 = ttk.Combobox(root, width=32) self.combo3["values"] = ["Linear Regression", "RF Regression", "Decision Trees Regression", "KNN Regression", "AdaBoost Regression", "Gradient Boosting Regression", "XGB Regression", "LGB Regression", "CatBoost Regression", "SVR Regression", "Lasso Regression", "Ridge Regression"] self.combo3.grid(row=8, column=0, padx=5, pady=1, sticky="n") self.combo4 = ttk.Combobox(root, width=32) self.combo4["values"] = ["Logistic Regression", "Random Forest", "Decision Trees", "KNN", "AdaBoost", "Gradient Boosting", "Extreme Gradient Boosting", "Light Gradient Boosting", "Multi-Layer Perceptron", "Support Vector Machine"] self.combo4.grid(row=10, column=0, padx=5, pady=1, sticky="n") self.combo5 = ttk.Combobox(root, width=32) self.combo5["values"] = ["Long-Short Term", "Convolutional NN", "Recurrent NN", "Feed-Forward NN", "Artifical NN"] self.combo5.grid(row=12, column=0, padx=5, pady=1, sticky="n") #process_data.py import os import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import RFE class Process_Data: def read_dataset(self): #Reads dataset curr_path = os.getcwd() df = pd.read_csv(curr_path+"/marketing_data.csv") #Drops ID column df = df.drop("ID", axis = 1) return df def preprocess(self): df = self.read_dataset() #Renames column name and corrects data type df.rename(columns={' Income ':'Income'},inplace=True) df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], format='%m/%d/%y') df["Income"] = df["Income"].str.replace("$","").str.replace(",","") df["Income"] = df["Income"].astype(float) #Checks null values print(df.isnull().sum()) print('Total number of null values: ', df.isnull().sum().sum()) #Imputes Income column with median values df['Income'] = df['Income'].fillna(df['Income'].median()) print(f'Number of Null values in "Income" after Imputation: {df["Income"].isna().sum()}') #Transformasi Dt_Customer df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer']) print(f'After Transformation:\n{df["Dt_Customer"].head()}') df['Customer_Age'] = df['Dt_Customer'].dt.year - df['Year_Birth'] #Creates number of children/dependents in home by adding 'Kidhome' and 'Teenhome' features #Creates number of Total_Purchases by adding all the purchases features #Creates TotalAmount_Spent by adding all the Mnt* features df['Dt_Customer_Month'] = df['Dt_Customer'].dt.month df['Dt_Customer_Year'] = df['Dt_Customer'].dt.year df['Num_Dependants'] = df['Kidhome'] + df['Teenhome'] purchase_features = [c for c in df.columns if 'Purchase' in str(c)] #Removes 'NumDealsPurchases' from the list above purchase_features.remove('NumDealsPurchases') df['Num_TotalPurchases'] = df[purchase_features].sum(axis = 1) amt_spent_features = [c for c in df.columns if 'Mnt' in str(c)] df['TotalAmount_Spent'] = df[amt_spent_features].sum(axis = 1) #Creates a categorical feature using the customer's age by binnning them, #to help understanding purchasing behaviour print(f'Min. Customer Age: {df["Customer_Age"].min()}') print(f'Max. Customer Age: {df["Customer_Age"].max()}') df['AgeGroup'] = pd.cut(df['Customer_Age'], bins = [6, 24, 29, 40, 56, 75], labels = ['Gen-Z', 'Gen-Y.1', 'Gen-Y.2', 'Gen-X', 'BBoomers']) return df def categorize(self, df): #Creates a dummy dataframe for visualization df_dummy=df.copy() #Categorizes Income feature labels = ['0-20k', '20k-30k', '30k-50k','50k-70k','70k-700k'] df_dummy['Income'] = pd.cut(df_dummy['Income'], [0, 20000, 30000, 50000, 70000, 700000], labels=labels) #Categorizes TotalAmount_Spent feature labels = ['0-200', '200-500', '500-800','800-1000','1000-3000'] df_dummy['TotalAmount_Spent'] = pd.cut(df_dummy['TotalAmount_Spent'], [0, 200, 500, 800, 1000, 3000], labels=labels) #Categorizes Num_TotalPurchases feature labels = ['0-5', '5-10', '10-15','15-25','25-35'] df_dummy['Num_TotalPurchases'] = pd.cut(df_dummy['Num_TotalPurchases'], [0, 5, 10, 15, 25, 35], labels=labels) #Categorizes Dt_Customer_Year feature labels = ['2012', '2013', '2014'] df_dummy['Dt_Customer_Year'] = pd.cut(df_dummy['Dt_Customer_Year'], [0, 2012, 2013, 2014], labels=labels) #Categorizes Dt_Customer_Month feature labels = ['0-3', '3-6', '6-9','9-12'] df_dummy['Dt_Customer_Month'] = pd.cut(df_dummy['Dt_Customer_Month'], [0, 3, 6, 9, 12], labels=labels) #Categorizes Customer_Age feature labels = ['0-30', '30-40', '40-50', '40-60','60-120'] df_dummy['Customer_Age'] = pd.cut(df_dummy['Customer_Age'], [0, 30, 40, 50, 60, 120], labels=labels) #Categorizes MntGoldProds feature labels = ['0-30', '30-50', '50-80', '80-100','100-400'] df_dummy['MntGoldProds'] = pd.cut(df_dummy['MntGoldProds'], [0, 30, 50, 80, 100, 400], labels=labels) #Categorizes MntSweetProducts feature labels = ['0-10', '10-20', '20-40', '40-100','100-300'] df_dummy['MntSweetProducts'] = pd.cut(df_dummy['MntSweetProducts'], [0, 10, 20, 40, 100, 300], labels=labels) #Categorizes MntFishProducts feature labels = ['0-10', '10-20', '20-40', '40-100','100-300'] df_dummy['MntFishProducts'] = pd.cut(df_dummy['MntFishProducts'], [0, 10, 20, 40, 100, 300], labels=labels) #Categorizes MntMeatProducts feature labels = ['0-50', '50-100', '100-200', '200-500','500-2000'] df_dummy['MntMeatProducts'] = pd.cut(df_dummy['MntMeatProducts'], [0, 50, 100, 200, 500, 2000], labels=labels) #Categorizes MntFruits feature labels = ['0-10', '10-30', '30-50', '50-100','100-200'] df_dummy['MntFruits'] = pd.cut(df_dummy['MntFruits'], [0, 1, 30, 50, 100, 200], labels=labels) #Categorizes MntWines feature labels = ['0-100', '100-300', '300-500', '500-1000','1000-1500'] df_dummy['MntWines'] = pd.cut(df_dummy['MntWines'], [0, 100, 300, 500, 1000, 1500], labels=labels) #Categorizes Recency feature labels = ['0-10', '10-30', '30-50', '50-80','80-100'] df_dummy['Recency'] = pd.cut(df_dummy['Recency'], [0, 10, 30, 50, 80, 100], labels=labels) return df_dummy def extract_cat_num_cols(self, df): #Extracts categorical and numerical columns in dummy dataset cat_cols = [col for col in df.columns if (df[col].dtype == 'object') or (df[col].dtype.name == 'category')] num_cols = [col for col in df.columns if (df[col].dtype != 'object') and (df[col].dtype.name != 'category')] return cat_cols, num_cols def encode_categorical_feats(self, df, cat_cols): #Encodes categorical features in original dataset print(f'Features that needs to be Label Encoded: \n{cat_cols}') for c in cat_cols: lbl = LabelEncoder() lbl.fit(list(df[c].astype(str).values)) df[c] = lbl.transform(list(df[c].astype(str).values)) print('Label Encoding done..') return df def extract_input_output_vars(self, df): #Extracts output and input variables y = df['Response'].values # Target for the model X = df.drop(['Dt_Customer', 'Year_Birth', 'Response'], axis = 1) return X, y def feat_importance_rf(self, X, y): names = X.columns rf = RandomForestClassifier() rf.fit(X, y) result_rf = pd.DataFrame() result_rf['Features'] = X.columns result_rf ['Values'] = rf.feature_importances_ result_rf.sort_values('Values', inplace = True, ascending = False) return result_rf def feat_importance_et(self, X, y): model = ExtraTreesClassifier() model.fit(X, y) result_et = pd.DataFrame() result_et['Features'] = X.columns result_et ['Values'] = model.feature_importances_ result_et.sort_values('Values', inplace=True, ascending =False) return result_et def feat_importance_rfe(self, X, y): model = LogisticRegression() #Creates the RFE model rfe = RFE(model) rfe = rfe.fit(X, y) result_lg = pd.DataFrame() result_lg['Features'] = X.columns result_lg ['Ranking'] = rfe.ranking_ result_lg.sort_values('Ranking', inplace=True , ascending = False) return result_lg #helper_plot.py import seaborn as sns import numpy as np from process_data import Process_Data class Helper_Plot: def __init__(self): self.obj_data = Process_Data() # Defines function to create pie chart and bar plot as subplots def plot_piechart(self, df, var, figure, canvas, title=''): figure.clear() # Pie Chart (Subplot kiri) plot1 = figure.add_subplot(2,1,1) label_list = list(df[var].value_counts().index) colors = sns.color_palette("deep", len(label_list)) _, _, autopcts = plot1.pie(df[var].value_counts(), autopct="%1.1f%%", colors=colors, startangle=30, labels=label_list, wedgeprops={"linewidth": 2, "edgecolor": "white"}, # Add white edge shadow=True, textprops={'fontsize': 7}) plot1.set_title("Distribution of " + var + " variable " + title, fontsize=10) # Bar Plot (Subplot Kanan) plot2 = figure.add_subplot(2,1,2) ax = df[var].value_counts().plot(kind="barh", color=colors, alpha=0.8, ax = plot2) for i, j in enumerate(df[var].value_counts().values): ax.text(.7, i, j, weight="bold", fontsize=7) plot2.set_title("Count of " + var + " cases " + title, fontsize=10) figure.tight_layout() canvas.draw() def another_versus_response(self, df, feat, num_bins, figure, canvas): figure.clear() plot1 = figure.add_subplot(2,1,1) colors = sns.color_palette("Set2") df[df['Response'] == 0][feat].plot(ax=plot1, kind='hist', bins=num_bins, edgecolor='black', color=colors[0]) plot1.set_title('Not Responsive', fontsize=15) plot1.set_xlabel(feat, fontsize=10) plot1.set_ylabel('Count', fontsize=10) data1 = [] for p in plot1.patches: x = p.get_x() + p.get_width() / 2. y = p.get_height() plot1.annotate(format(y, '.0f'), (x, y), ha='center', va='center', xytext=(0, 10), weight="bold", fontsize=7, textcoords='offset points') data1.append([x, y]) plot2 = figure.add_subplot(2,1,2) df[df['Response'] == 1][feat].plot(ax=plot2, kind='hist', bins=num_bins, edgecolor='black', color=colors[1]) plot2.set_title('Responsive', fontsize=15) plot2.set_xlabel(feat, fontsize=10) plot2.set_ylabel('Count', fontsize=10) data2 = [] for p in plot2.patches: x = p.get_x() + p.get_width() / 2. y = p.get_height() plot2.annotate(format(y, '.0f'), (x, y), ha='center', va='center', xytext=(0, 10), weight="bold", fontsize=7, textcoords='offset points') data2.append([x, y]) figure.tight_layout() canvas.draw() #Puts label inside stacked bar def put_label_stacked_bar(self, ax,fontsize): #patches is everything inside of the chart for rect in ax.patches: # Find where everything is located height = rect.get_height() width = rect.get_width() x = rect.get_x() y = rect.get_y() # The height of the bar is the data value and can be used as the label label_text = f'{height:.0f}' # ax.text(x, y, text) label_x = x + width / 2 label_y = y + height / 2 # plots only when height is greater than specified value if height > 0: ax.text(label_x, label_y, label_text, \ ha='center', va='center', \ weight = "bold",fontsize=fontsize) #Plots one variable against another variable def dist_one_vs_another_plot(self, df, cat1, cat2, figure, canvas, title): figure.clear() plot1 = figure.add_subplot(1,1,1) group_by_stat = df.groupby([cat1, cat2]).size() colors = sns.color_palette("Set2", len(df[cat1].unique())) stacked_data = group_by_stat.unstack() group_by_stat.unstack().plot(kind='bar', stacked=True, ax=plot1, grid=True, color=colors) plot1.set_title(title, fontsize=12) plot1.set_ylabel('Number of Cases', fontsize=10) plot1.set_xlabel(cat1, fontsize=10) self.put_label_stacked_bar(plot1,7) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=8) plot1.tick_params(axis='both', which='minor', labelsize=8) plot1.legend(fontsize=8) figure.tight_layout() canvas.draw() def box_plot(self, df, x, y, hue, figure, canvas, title): figure.clear() plot1 = figure.add_subplot(1,1,1) #Creates boxplot of Num_TotalPurchases versus Num_Dependants sns.boxplot(data = df, x = x, y = y, hue = hue, ax=plot1) plot1.set_title(title, fontsize=14) plot1.set_xlabel(x, fontsize=10) plot1.set_ylabel(y, fontsize=10) figure.tight_layout() canvas.draw() def choose_plot(self, df1, df2, chosen, figure1, canvas1, figure2, canvas2): print(chosen) if chosen == "Marital Status": self.plot_piechart(df2, "Marital_Status", figure1, canvas1) elif chosen == "Education": self.plot_piechart(df2, "Education", figure2, canvas2) elif chosen == "Country": self.plot_piechart(df2, "Country", figure1, canvas1) elif chosen == "Age Group": self.plot_piechart(df2, "AgeGroup", figure2, canvas2) elif chosen == "Age Group": self.plot_piechart(df2, "AgeGroup", figure2, canvas2) elif chosen == "Education with Response 0": self.plot_piechart(df2[df2.Response==0], "Education", figure1, canvas1, " with Response 0") elif chosen == "Education with Response 1": self.plot_piechart(df2[df2.Response==1], "Education", figure2, canvas2, " with Response 1") elif chosen == "Country with Response 0": self.plot_piechart(df2[df2.Response==0], "Country", figure1, canvas1, " with Response 0") elif chosen == "Country with Response 1": self.plot_piechart(df2[df2.Response==1], "Country", figure2, canvas2, " with Response 1") elif chosen == "Income": self.another_versus_response(df1, "Income", 32, figure1, canvas1) elif chosen == "Mount of Wines": self.another_versus_response(df1, "MntWines", 32, figure2, canvas2) elif chosen == "Customer Age": self.another_versus_response(df1, "Customer_Age", 32, figure1, canvas1) elif chosen == "Education versus Response": self.dist_one_vs_another_plot(df2, "Education", "Response", figure2, canvas2, chosen) elif chosen == "Age Group versus Response": self.dist_one_vs_another_plot(df2, "AgeGroup", "Response", figure1, canvas1, chosen) elif chosen == "Marital Status versus Response": self.dist_one_vs_another_plot(df2, "Marital_Status", "Response", figure2, canvas2, chosen) elif chosen == "Country versus Response": self.dist_one_vs_another_plot(df2, "Country", "Response", figure1, canvas1, chosen) elif chosen == "Number of Dependants versus Response": self.dist_one_vs_another_plot(df2, "Num_Dependants", "Response", figure2, canvas2, chosen) elif chosen == "Country versus Customer Age Per Education": self.box_plot(df1, "Country", "Customer_Age", "Education", figure1, canvas1, chosen) elif chosen == "Num_TotalPurchases versus Education Per Marital Status": self.box_plot(df1, "Education", "Num_TotalPurchases", "Marital_Status", figure2, canvas2, chosen) def choose_category(self, df, chosen, figure1, canvas1, figure2, canvas2): if chosen == "Categorized Income versus Response": self.dist_one_vs_another_plot(df, "Income", "Response", figure1, canvas1, chosen) if chosen == "Categorized Total Purchase versus Categorized Income": self.dist_one_vs_another_plot(df, "Num_TotalPurchases", "Income", figure2, canvas2, chosen) if chosen == "Categorized Recency versus Categorized Total Purchase": self.dist_one_vs_another_plot(df, "Recency", "Num_TotalPurchases", figure1, canvas1, chosen) if chosen == "Categorized Customer Month versus Categorized Customer Age": self.dist_one_vs_another_plot(df, "Dt_Customer_Month", "Customer_Age", figure2, canvas2, chosen) if chosen == "Categorized Mount of Gold Products versus Categorized Income": self.dist_one_vs_another_plot(df, "MntGoldProds", "Income", figure1, canvas1, chosen) if chosen == "Categorized Mount of Fish Products versus Categorized Total AmountSpent": self.dist_one_vs_another_plot(df, "MntFishProducts", "TotalAmount_Spent", figure2, canvas2, chosen) if chosen == "Categorized Mount of Meat Products versus Categorized Recency": self.dist_one_vs_another_plot(df, "MntMeatProducts", "Recency", figure1, canvas1, chosen) def plot_corr_mat(self, df, figure, canvas): figure.clear() plot1 = figure.add_subplot(1,1,1) categorical_columns = df.select_dtypes(include=['object', 'category']).columns df_removed = df.drop(columns=categorical_columns) corrdata = df_removed.corr() annot_kws = {"size": 5} sns.heatmap(corrdata, ax = plot1, lw=1, annot=True, cmap="Reds", annot_kws=annot_kws) plot1.set_title('Correlation Matrix', fontweight ="bold",fontsize=14) # Set font for x and y labels plot1.set_xlabel('Features', fontweight="bold", fontsize=12) plot1.set_ylabel('Features', fontweight="bold", fontsize=12) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=5) plot1.tick_params(axis='both', which='minor', labelsize=5) figure.tight_layout() canvas.draw() def plot_rf_importance(self, X, y, figure, canvas): result_rf = self.obj_data.feat_importance_rf(X, y) figure.clear() plot1 = figure.add_subplot(1,1,1) sns.set_color_codes("pastel") ax=sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Blue", ax=plot1) plot1.set_title('Random Forest Features Importance', fontweight ="bold",fontsize=14) plot1.set_xlabel('Features Importance', fontsize=10) plot1.set_ylabel('Feature Labels', fontsize=10) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=5) plot1.tick_params(axis='both', which='minor', labelsize=5) figure.tight_layout() canvas.draw() def plot_et_importance(self, X, y, figure, canvas): result_rf = self.obj_data.feat_importance_et(X, y) figure.clear() plot1 = figure.add_subplot(1,1,1) sns.set_color_codes("pastel") ax=sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Red", ax=plot1) plot1.set_title('Extra Trees Features Importance', fontweight ="bold",fontsize=14) plot1.set_xlabel('Features Importance', fontsize=10) plot1.set_ylabel('Feature Labels', fontsize=10) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=5) plot1.tick_params(axis='both', which='minor', labelsize=5) figure.tight_layout() canvas.draw() def plot_rfe_importance(self, X, y, figure, canvas): result_lg = self.obj_data.feat_importance_rfe(X, y) figure.clear() plot1 = figure.add_subplot(1,1,1) sns.set_color_codes("pastel") ax=sns.barplot(x = 'Ranking',y = 'Features', data=result_lg, color="orange", ax=plot1) plot1.set_title('RFE Features Importance', fontweight ="bold",fontsize=14) plot1.set_xlabel('Features Importance', fontsize=10) plot1.set_ylabel('Feature Labels', fontsize=10) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=5) plot1.tick_params(axis='both', which='minor', labelsize=5) figure.tight_layout() canvas.draw() def choose_plot_more(self, df, chosen, X, y, figure1, canvas1, figure2, canvas2): if chosen == "Correlation Matrix": self.plot_corr_mat(df, figure1, canvas1) if chosen == "RF Features Importance": self.plot_rf_importance(X, y, figure2, canvas2) if chosen == "ET Features Importance": self.plot_et_importance(X, y, figure1, canvas1) if chosen == "RFE Features Importance": self.plot_rfe_importance(X, y, figure1, canvas1)
No comments:
Post a Comment