This is the product of Balige Academy, North Sumatera.
Balige Academy Team
Vivian Siahaan
RIsmon Hasiholan Sianipar
HORASSS!!!
The Time-Series Weather Forecasting and Prediction using Machine Learning with Tkinter project is a comprehensive endeavor aimed at providing accurate and insightful weather forecasts. Beginning with data visualization, the project employs Tkinter, a powerful Python library, to create an interactive graphical user interface (GUI) for users. This GUI allows for easy data input and visualization, enhancing user experience.
One critical aspect of this project lies in understanding the distribution of features within the weather dataset. The authors have meticulously analyzed and visualized the data, gaining valuable insights into temperature trends, precipitation, wind patterns, and more. This step is pivotal in identifying patterns and anomalies, which in turn aids in making accurate forecasts.
A standout feature of this project is its focus on temperature feature forecasting. By utilizing machine learning regressors, such as Random Forest Regressor, KNN regressor, Support Vector Regressor, AdaBoost regressor, Gradient Boosting Regressor, MLP regressor, Lasso regressor, and Ridge regressor, the project excels in predicting temperature trends. Through a rigorous training process, these models learn from historical weather data to make precise forecasts. The performance of these regressors is evaluated through metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), ensuring the highest level of accuracy.
The project's data visualization capabilities are not limited to historical data alone. It extends to visualizing the predicted temperature trends, allowing users to gain insights into future temperature forecasts. This dynamic feature empowers users to make informed decisions based on upcoming weather conditions.
To further enhance the predictive capabilities, the project integrates grid search optimization. This technique fine-tunes the machine learning models by systematically searching through a hyperparameter space. By selecting the most optimal combination of hyperparameters, the models are optimized for the best forecasting results. This meticulous process significantly improves the accuracy of the predictions.
The project also tackles the challenging task of weather summary prediction. By employing machine learning classifiers like Random Forest Classifier, Support Vector Classifier, and K-Nearest Neighbors Classifier, Linear Regression Classifier, AdaBoost Classifier, Support Vector Classifier, Gradient Boosting Classifier, Extreme Gradient Boosting Classifier, and Multi-Layer Perceptron Classifier, the project successfully predicts weather conditions such as 'Clear', ‘Foggy’, ‘Clear’, ‘Overcast’, and more. Through a robust training regimen, these models learn to classify weather summaries based on a range of features including temperature, humidity, wind speed, and more.
The integration of grid search optimization into the classifier models further elevates the accuracy of weather summary predictions. This systematic hyperparameter tuning ensures that the classifiers are operating at their peak performance levels. As a result, users can rely on the system for highly accurate and reliable weather forecasts.
Incorporating a user-friendly GUI with Tkinter, this project offers an accessible platform for users to interact with the weather forecasting system. Users can input specific parameters, visualize data trends, and receive precise forecasts in real-time. The intuitive design ensures that even individuals with limited technical expertise can navigate and benefit from the application.
In summary, the Time-Series Weather Forecasting and Prediction using Machine Learning with Tkinter project stands as a testament to the power of combining machine learning, data visualization, and user-friendly interfaces to create a highly accurate and accessible weather forecasting system. From visualizing feature distributions to temperature forecasting and weather summary prediction, every facet of the project is meticulously designed to provide users with the most reliable and precise weather forecasts possible. With its emphasis on user feedback, scalability, and open-source collaboration, this project is poised to make a significant impact in the field of weather forecasting.
SOURCE CODE:
#main_program.py import os import tkinter as tk from design_window import Design_Window from process_data import Process_Data from helper_plot import Helper_Plot from regression import Regression from machine_learning import Machine_Learning class Main_Program: def __init__(self, root): self.initialize() def initialize(self): self.root = root width = 1560 height = 790 self.root.geometry(f"{width}x{height}") self.root.title("TIME-SERIES WEATHER ANALYSIS, FORECASTING, AND PREDICTION USING MACHINE/DEEP LEARNING") #Creates necessary objects self.obj_window = Design_Window() self.obj_data = Process_Data() self.obj_plot = Helper_Plot() self.obj_reg = Regression() self.obj_ML = Machine_Learning() #Places widgets in root self.obj_window.add_widgets(self.root) #Reads dataset and categorization self.df, self.df_cat = self.obj_data.preprocess() #Binds event self.binds_event() #For machine learning self.df_final, self.X1, self.y1, self.X2, self.y2 = self.obj_data.encode_df(self.df) #Extracts input and output variables for regression self.obj_reg.splitting_data_regression(self.X2, self.y2) #Extracts input and output variables for prediction self.obj_ML.oversampling_splitting(self.df_final) #turns off combo_reg and combo_pred after splitting is done self.obj_window.combo_reg['state'] = 'disabled' self.obj_window.combo_pred['state'] = 'disabled' def binds_event(self): #Binds button1 to shows_table() function #Shows table if user clicks LOAD DATASET self.obj_window.btn_load.config(command = lambda:self.obj_plot.shows_table(self.root, self.df, 1250, 600, "Weather Dataset")) #Binds listbox to choose_list_widget() function self.obj_window.listbox.bind("<<ListboxSelect>>", self.choose_list_widget) # Binds combo_year to choose_combo_year() self.obj_window.combo_year.bind("<<ComboboxSelected>>", self.choose_combo_year) # Binds combo_month to choose_combobox_month() self.obj_window.combo_month.bind("<<ComboboxSelected>>", self.choose_combobox_month) #Binds btn_reg to split_regression() function self.obj_window.btn_reg.config(command = self.split_regression) #Binds combo_pred to split_prediction() function self.obj_window.btn_pred.config(command = self.split_prediction) # Binds combo_tech to choose_combo_tech() self.obj_window.combo_feat.bind("<<ComboboxSelected>>", self.choose_combo_feat) # Binds combo_reg to choose_combo_reg() self.obj_window.combo_reg.bind("<<ComboboxSelected>>", self.choose_combo_reg) # Binds combo_pred to choose_combo_pred() self.obj_window.combo_pred.bind("<<ComboboxSelected>>", self.choose_combo_pred) def choose_list_widget(self, event): chosen = self.obj_window.listbox.get(self.obj_window.listbox.curselection()) print(chosen) self.obj_plot.choose_plot(self.df_final, self.df_cat, chosen, self.obj_window.figure1, self.obj_window.canvas1, self.obj_window.figure2, self.obj_window.canvas2) def choose_combo_year(self, event): chosen = self.obj_window.combo_year.get() year_data_mean, year_data_ewm, year_norm = self.obj_data.normalize_year_wise_data(self.df_final) print(year_data_mean) self.obj_plot.choose_year_wise(self.df_final, year_data_mean, year_data_ewm, year_norm, chosen, self.obj_window.figure1, self.obj_window.canvas1, self.obj_window.figure2, self.obj_window.canvas2) def choose_combobox_month(self, event): chosen = self.obj_window.combo_month.get() month_data_mean, month_data_ewm, month_norm = self.obj_data.normalize_month_wise_data(self.df_final) self.obj_plot.choose_month_wise(self.df_cat, month_data_mean, month_data_ewm, month_norm, chosen, self.obj_window.figure1, self.obj_window.canvas1, self.obj_window.figure2, self.obj_window.canvas2) def choose_combo_feat(self, event): chosen = self.obj_window.combo_feat.get() self.obj_plot.choose_feat_importance(chosen, self.X1, self.y1, self.obj_window.figure1, self.obj_window.canvas1, self.obj_window.figure2, self.obj_window.canvas2) def split_regression(self): file_path = os.getcwd()+"/X_final_reg.pkl" if os.path.exists(file_path): self.X_Ori, self.X_final_reg, self.X_train_reg, self.X_test_reg, \ self.X_val_reg, self.y_final_reg, self.y_train_reg, \ self.y_test_reg, self.y_val_reg = self.obj_reg.load_regression_files() else: self.obj_reg.splitting_data_regression(self.df_final) self.X_Ori, self.X_final_reg, self.X_train_reg, self.X_test_reg, self.X_val_reg, self.y_final_reg, self.y_train_reg, self.y_test_reg, self.y_val_reg = self.obj_reg.load_regression_files() print("Loading regression files done...") #turns on combo_reg after splitting is done self.obj_window.combo_reg['state'] = 'normal' self.obj_window.btn_reg.config(state="disabled") def choose_combo_reg(self, event): chosen = self.obj_window.combo_reg.get() self.obj_plot.choose_plot_regression(chosen, self.X_final_reg, self.X_train_reg, self.X_test_reg, self.X_val_reg, self.y_final_reg, self.y_train_reg, self.y_test_reg, self.y_val_reg, self.obj_window.figure1, self.obj_window.canvas1, self.obj_window.figure2, self.obj_window.canvas2) def split_prediction(self): file_path = os.getcwd()+"/X_train.pkl" if os.path.exists(file_path): self.X_train, self.X_test, self.y_train, self.y_test = self.obj_ML.load_files() else: self.obj_ML.oversampling_splitting(self.df_final) self.X_train, self.X_test, self.y_train, self.y_test = self.obj_ML.load_files() print("Loading files done...") #turns on combo_pred after splitting is done self.obj_window.combo_pred['state'] = 'normal' self.obj_window.btn_pred.config(state="disabled") def choose_combo_pred(self, event): chosen = self.obj_window.combo_pred.get() self.obj_plot.choose_plot_ML(self.root, chosen, self.X_train, self.X_test, self.y_train, self.y_test, self.obj_window.figure1, self.obj_window.canvas1, self.obj_window.figure2, self.obj_window.canvas2) if __name__ == "__main__": root = tk.Tk() app = Main_Program(root) root.mainloop() #design_window.py import tkinter as tk from tkinter import ttk from matplotlib.figure import Figure from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg class Design_Window: def add_widgets(self, root): #Set styles self.set_style(root) #Adds button(s) self.add_buttons(root) #Adds canvasses self.add_canvas(root) #Adds labels self.add_labels(root) #Adds listbox widget self.add_listbox(root) #Adds combobox widget self.add_comboboxes(root) def set_style(self, root): # variables created for colors ebg = '#404040' fg = '#FFFFFF' style = ttk.Style() # Be sure to include this or style.map() won't function as expected. style.theme_use('alt') # the following alters the Listbox root.option_add('*TCombobox*Listbox.Background', ebg) root.option_add('*TCombobox*Listbox.Foreground', fg) root.option_add('*TCombobox*Listbox.selectBackground', fg) root.option_add('*TCombobox*Listbox.selectForeground', ebg) # the following alters the Combobox entry field style.map('TCombobox', fieldbackground=[('readonly', ebg)]) style.map('TCombobox', selectbackground=[('readonly', ebg)]) style.map('TCombobox', selectforeground=[('readonly', fg)]) style.map('TCombobox', background=[('readonly', ebg)]) style.map('TCombobox', foreground=[('readonly', fg)]) def add_buttons(self, root): #Adds button self.btn_load = tk.Button(root, height=1, width=35, text="LOAD DATASET") self.btn_load.grid(row=0, column=0, padx=5, pady=5, sticky="w") self.btn_reg = tk.Button(root, height=1, width=35, text="SPLIT DATA FOR FORECASTING") self.btn_reg.grid(row=9, column=0, padx=5, pady=5, sticky="w") self.btn_pred = tk.Button(root, height=1, width=35, text="SPLIT DATA FOR PREDICTION") self.btn_pred.grid(row=12, column=0, padx=5, pady=5, sticky="w") def add_labels(self, root): #Adds labels self.label1 = tk.Label(root, text = "CHOOSE DISTRIBUTION") self.label1.grid(row=1, column=0, padx=5, pady=1, sticky="w") self.label2 = tk.Label(root, text = "YEAR-WISE TIME-SERIES PLOT") self.label2.grid(row=3, column=0, padx=5, pady=1, sticky="w") self.label3 = tk.Label(root, text = "MONTH-WISE TIME-SERIES PLOT") self.label3.grid(row=5, column=0, padx=5, pady=1, sticky="w") self.label4 = tk.Label(root, text = "FEATURES IMPORTANCE") self.label4.grid(row=7, column=0, padx=5, pady=1, sticky="w") self.label5 = tk.Label(root, text = "CHOOSE FORECASTING") self.label5.grid(row=10, column=0, padx=5, pady=1, sticky="w") self.label6 = tk.Label(root, text = "CHOOSE ML PREDICTION") self.label6.grid(row=13, column=0, padx=5, pady=1, sticky="w") def add_canvas(self, root): #Menambahkan canvas1 widget pada root untuk menampilkan hasil self.figure1 = Figure(figsize=(6.2, 7.8), dpi=100) self.figure1.patch.set_facecolor('#F0F0F0') self.canvas1 = FigureCanvasTkAgg(self.figure1, master=root) self.canvas1.get_tk_widget().grid(row=0, column=1, columnspan=1, rowspan=25, padx=5, pady=5, sticky="n") #Menambahkan canvas2 widget pada root untuk menampilkan hasil self.figure2 = Figure(figsize=(6.2, 7.8), dpi=100) self.figure2.patch.set_facecolor('#F0F0F0') self.canvas2 = FigureCanvasTkAgg(self.figure2, master=root) self.canvas2.get_tk_widget().grid(row=0, column=2, columnspan=1, rowspan=25, padx=5, pady=5, sticky="n") def add_listbox(self, root): #Creates list widget self.listbox = tk.Listbox(root, height=15, selectmode=tk.SINGLE, width=40, fg ="black", bg="#F0F0F0", highlightcolor="black", selectbackground="red",relief="flat", borderwidth=5, highlightthickness=0) self.listbox.grid(row=2, column=0, sticky='n', padx=5, pady=1) self.scrollbar_v = tk.Scrollbar(root, orient="vertical",command=self.listbox.yview) self.scrollbar_v.grid(row=2, column=0, sticky='nse') self.scrollbar_h = tk.Scrollbar(root, orient="horizontal",command=self.listbox.xview) self.scrollbar_h.grid(row=2, column=0, sticky='ews') self.listbox.config(yscrollcommand=self.scrollbar_v.set, xscrollcommand=self.scrollbar_h.set) #Inserts item into list widget items = ["Missing Values", "Correlation Coefficient", "Year", "Day", "Month", "Quarter", "Summary", "Daily Summary", "Precipitation Type", "Temperature (C) versus Humidity versus Year", "Wind Speed (km/h) versus Visibility (km) versus Quarter", "Pressure (millibars) versus Apparent Temperature (C) versus Month", "Temperature (C) versus Visibility (km) versus Day", "Year versus Categorized Humidity", "Day versus Categorized Temperature", "Week versus Categorized Wind Bearing Speed", "Month versus Categorized Visibility", "Quarter versus Categorized Humidity", "Year versus Temperature Per Categorized Visibility", "Month versus Wind Bearing Per Categorized Pressure", "Quarter versus Humidity Per Categorized Wind Speed", "Day versus Temperature Per Categorized Temperature", "Distribution of Categorized Visibility", "Distribution of Categorized Temperature", "Distribution of Categorized Wind Speed", "Distribution of Categorized Wind Bearing", "Distribution of Categorized Pressure", "Distribution of Categorized Humidity", "Correlation Matrix"] for item in items: self.listbox.insert(tk.END, item) def add_comboboxes(self, root): # Create ComboBoxes self.combo_year = ttk.Combobox(root, width=38) self.combo_year["values"] = ["Temperature (C) and Apparent Temperature (C)", "Wind Speed (km/h) and Visibility (km)", "Wind Bearing (degrees) and Pressure (millibars)", "Year-Wise Mean EWM Temperature (C) and Apparent Temperature (C)", "Year-Wise Mean EWM Wind Speed (km/h) and Visibility (km)", "Normalized Year-Wise Data", "Temperature (C) by Year", "Wind Speed (km/h) by Year", "Visibility (km) by Year", "Pressure (millibars) by Year", "Apparent Temperature (C) by Year", "Wind Bearing (degrees) by Year"] self.combo_year.grid(row=4, column=0, padx=5, pady=1, sticky="n") self.combo_month = ttk.Combobox(root, width=38, style='TCombobox') self.combo_month["values"] = ["Quarter-Wise Temperature (C) and Apparent Temperature (C)", "Quarter-Wise Wind Speed (km/h) and Visibility (km)", "Month-Wise Temperature (C) and Apparent Temperature (C)", "Month-Wise Mean EWM Wind Speed (km/h) and Visibility (km)", "Month-Wise Mean EWM Temperature (C) and Apparent Temperature (C)", "Month-Wise Temperature (C)", "Month-Wise Apparent Temperature (C)", "Month-Wise Wind Speed (km/h)", "Month-Wise Wind Bearing (degrees)", "Month-Wise Visibility (km)", "Month-Wise Pressure (millibars)", "Month-Wise Humidity", "Normalized Month-Wise Data", "Temperature (C) by Month", "Wind Speed (km/h) by Month", "Apparent Temperature (C) by Month", "Wind Bearing (degrees) by Month", "Visibility (km) by Month", "Pressure (millibars) by Month"] self.combo_month.grid(row=6, column=0, padx=5, pady=1, sticky="n") self.combo_feat = ttk.Combobox(root, width=38, style='TCombobox') self.combo_feat["values"] = ["RF Features Importance", "ET Features Importance", "RFE Features Importance"] self.combo_feat.grid(row=8, column=0, padx=5, pady=1, sticky="n") self.combo_reg = ttk.Combobox(root, width=38, style='TCombobox') self.combo_reg["values"] = ["Linear Regression", "RF Regression", "Decision Trees Regression", "KNN Regression", "AdaBoost Regression", "Gradient Boosting Regression", "MLP Regression", "SVR Regression", "Lasso Regression", "Ridge Regression"] self.combo_reg.grid(row=11, column=0, padx=5, pady=1, sticky="n") self.combo_pred = ttk.Combobox(root, width=38, style='TCombobox') self.combo_pred["values"] = ["Logistic Regression", "Random Forest", "Decision Trees", "K-Nearest Neighbors", "AdaBoost", "Gradient Boosting", "Extreme Gradient Boosting", "Light Gradient Boosting", "Multi-Layer Perceptron", "Support Vector Classifier"] self.combo_pred.grid(row=14, column=0, padx=5, pady=1, sticky="n") #process_data.py import os import pandas as pd from datetime import datetime import numpy as np from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import RFE class Process_Data: def read_dataset(self, filename): #Reads dataset curr_path = os.getcwd() path = os.path.join(curr_path, filename) df = pd.read_csv(path) return df #Discretizes date to add four new features into dataset def discretize_date(self, current_date, t): current_date = current_date[:-10] cdate = datetime.strptime(current_date, '%Y-%m-%d %H:%M:%S') if t == 'hour_sin': return np.sin(2 * np.pi * cdate.hour/24.0) if t == 'hour_cos': return np.cos(2 * np.pi * cdate.hour/24.0) if t == 'day_sin': return np.sin(2 * np.pi * cdate.timetuple().tm_yday/365.0) if t == 'day_cos': return np.cos(2 * np.pi * cdate.timetuple().tm_yday/365.0) def preprocess(self): df = self.read_dataset("weatherHistory.csv") df = df.sort_values(by='Formatted Date', ascending=False) #Drops the column 'Loud Cover' df = df.drop('Loud Cover', axis=1) #Replaces the missing values with forward fill method df = df.fillna(method='ffill') date_types = ['hour_sin', 'hour_cos', 'day_sin', 'day_cos'] for dt in date_types: df[dt] = df['Formatted Date'].apply(lambda x : self.discretize_date(x, dt)) #Extracts day, month, week, quarter, and year df['Date'] = pd.to_datetime(df['Formatted Date'], utc=True) print(df['Date'].dtype) df['Day'] = df['Date'].dt.weekday df['Month'] = df['Date'].dt.month df['Year'] = df['Date'].dt.year df['Week'] = df['Date'].dt.isocalendar().week df['Quarter']= df['Date'].dt.quarter #Drops Formatted Date column df.drop(['Formatted Date'],axis=1,inplace=True) #Sets Date column as index df = df.set_index("Date") #Rename columns # Rename multiple columns df = df.rename(columns={'Precip Type': 'Precip_Type', 'Temperature (C)': 'Temperature', 'Apparent Temperature (C)':'Apparent_Temperature', 'Wind Speed (km/h)': 'Wind_Speed', 'Wind Bearing (degrees)': 'Wind_Bearing', 'Visibility (km)': 'Visibility', 'Pressure (millibars)': 'Pressure', 'Daily Summary': 'Daily_Summary'}) df_cat = df.copy() df_cat = self.categorize(df_cat) return df, df_cat def categorize(self, df): #Converts days and months from numerics to meaningful string days = {0:'Sunday',1:'Monday',2:'Tuesday',3:'Wednesday', 4:'Thursday',5: 'Friday',6:'Saturday'} df['Day_Cat'] = df['Day'].map(days) months={1:'January',2:'February',3:'March',4:'April', 5:'May',6:'June',7:'July',8:'August',9:'September', 10:'October',11:'November',12:'December'} df['Month_Cat']= df['Month'].map(months) quarters = {1:'Jan-March', 2:'April-June',3:'July-Sept', 4:'Oct-Dec'} df['Quarter_Cat'] = df['Quarter'].map(quarters) #Categorizes Temperature (C) feature labels = ['-21(C)-0(C)', '0(C)-5(C)', '5(C)-15(C)','15(C)-25(C)','25(C)-45(C)'] df["Temperature_Cat"] = pd.cut(df["Temperature"], [-21, 0, 5, 15, 25, 45], labels=labels) #Categorizes Humidity feature labels = ['0.0-0.4', '0.4-0.5', '0.5-0.75', '0.75-1.0'] df["Humidity_Cat"] = pd.cut(df["Humidity"], [0.0, 0.4, 0.5, 0.75, 1.0], labels=labels) #Categorizes Wind Speed (km/h) feature labels = ['0-5', '5-10', '10-20', '20-70'] df["Wind_Speed_Cat"] = pd.cut(df["Wind_Speed"], [0, 5, 10, 20, 70], labels=labels) #Categorizes Wind Bearing (degrees) feature labels = ['0-90', '90-180', '180-270', '270-360'] df["Wind_Bearing_Cat"] = pd.cut(df["Wind_Bearing"], [0, 90, 180, 270, 360], labels=labels) #Categorizes Visibility (km) feature labels = ['0-5', '5-10', '10-12', '12-17'] df["Visibility_Cat"] = pd.cut(df["Visibility"], [0, 5, 10, 12, 17], labels=labels) #Categorizes Pressure (millibars) feature labels = ['1000-1015', '1015-1020', '1020-1030', '1030-1100'] df["Pressure_Cat"] = pd.cut(df["Pressure"], [1000, 1015, 1020, 1030, 1100], labels=labels) return df def encode_df(self, df): #Drops Daily Summary column df.drop(['Daily_Summary'],axis=1,inplace=True) #Controls the size of dataset for regression and prediction to suit your computing power #df=df[df["Year"] == 2016] df = df[(df["Year"] == 2016) & (df["Month"] >= 3) & (df["Month"] <= 7)] #Selects data in year 2015-2016, because very big dataset #df = df[df['Year'].isin([2015, 2016])] #Replaces the value of Summary which has only less than or equal 4 samples print(df.Summary.unique()) df["Summary"].replace({"Dangerously Windy and Partly Cloudy": "Mostly Cloudy"}, inplace=True) df["Summary"].replace({"Windy and Dry": "Breezy and Mostly Cloudy", "Breezy and Dry": "Breezy and Mostly Cloudy"}, inplace=True) df["Summary"].replace({"Windy and Foggy": "Foggy"}, inplace=True) df["Summary"].replace({"Breezy and Foggy": "Foggy"}, inplace=True) df["Summary"].replace({"Dry and Partly Cloudy": "Partly Cloudy"}, inplace=True) df["Summary"].replace({"Humid and Overcast": "Overcast"}, inplace=True) df["Summary"].replace({"Breezy": "Breezy and Mostly Cloudy"}, inplace=True) df["Summary"].replace({"Windy and Mostly Cloudy": "Mostly Cloudy"}, inplace=True) df["Summary"].replace({"Windy and Overcast": "Breezy and Mostly Cloudy"}, inplace=True) df["Summary"].replace({"Windy and Partly Cloudy": "Partly Cloudy"}, inplace=True) df["Summary"].replace({"Humid and Mostly Cloudy": "Mostly Cloudy"}, inplace=True) df["Summary"].replace({"Humid and Mostly Cloudy": "Mostly Cloudy"}, inplace=True) df["Summary"].replace({"Humid and Mostly Cloudy": "Mostly Cloudy"}, inplace=True) df["Summary"].replace({"Drizzle": "Rain"}, inplace=True) df["Summary"].replace({"Light Rain": "Rain"}, inplace=True) print(df.Summary.unique()) #Encodes Precip Type and Summary le=LabelEncoder() df['Precip_Type']=le.fit_transform(df['Precip_Type']) df['Summary']=le.fit_transform(df['Summary']) print(df.head().to_string()) #Extracts output and input variables y1 = df['Summary'] # Target for the model X1 = df.drop(['Summary'], axis = 1) y2 = df['Temperature'] # Target for the model X2 = df.drop(['Temperature'], axis = 1) return df, X1, y1, X2, y2 def normalize_year_wise_data(self, df): #Normalizes year-wise data cols = list(df.columns) cols.remove("Month") cols.remove("Day") cols.remove("Week") cols.remove("Year") cols.remove("Quarter") year_data_mean = df.resample('y').mean() year_data_ewm = year_data_mean.ewm(span=5).mean() year_norm = (year_data_mean[cols] - year_data_mean[cols].min()) / (year_data_mean[cols].max() - year_data_mean[cols].min()) return year_data_mean, year_data_ewm, year_norm def normalize_month_wise_data(self, df): cols = list(df.columns) cols.remove("Month") cols.remove("Day") cols.remove("Week") cols.remove("Year") cols.remove("Quarter") month_data_mean = df[cols].resample('m').mean() month_data_ewm = month_data_mean.ewm(span=5).mean() month_norm = (month_data_mean - month_data_mean.min()) / (month_data_mean.max() - month_data_mean.min()) return month_data_mean, month_data_ewm, month_norm def feat_importance_rf(self, X, y): names = X.columns rf = RandomForestClassifier() rf.fit(X, y) result_rf = pd.DataFrame() result_rf['Features'] = X.columns result_rf ['Values'] = rf.feature_importances_ result_rf.sort_values('Values', inplace = True, ascending = False) return result_rf def feat_importance_et(self, X, y): model = ExtraTreesClassifier() model.fit(X, y) result_et = pd.DataFrame() result_et['Features'] = X.columns result_et ['Values'] = model.feature_importances_ result_et.sort_values('Values', inplace=True, ascending =False) return result_et def feat_importance_rfe(self, X, y): model = LogisticRegression() #Creates the RFE model rfe = RFE(model) rfe = rfe.fit(X, y) result_lg = pd.DataFrame() result_lg['Features'] = X.columns result_lg ['Ranking'] = rfe.ranking_ result_lg.sort_values('Ranking', inplace=True , ascending = False) return result_lg def save_result(self, y_test, y_pred, fname): # Convert y_test and y_pred to pandas Series for easier handling y_test_series = pd.Series(y_test) y_pred_series = pd.Series(y_pred) # Calculate y_result_series y_result_series = pd.Series(y_pred - y_test == 0) y_result_series = y_result_series.map({True: 'True', False: 'False'}) # Create a DataFrame to hold y_test, y_pred, and y_result data = pd.DataFrame({'y_test': y_test_series, 'y_pred': y_pred_series, 'result': y_result_series}) # Save the DataFrame to a CSV file data.to_csv(fname, index=False) #helper_plot.py from tkinter import * import seaborn as sns import numpy as np import pandas as pd from pandastable import Table from sklearn.metrics import confusion_matrix, roc_curve, accuracy_score from sklearn.model_selection import learning_curve from process_data import Process_Data from machine_learning import Machine_Learning from regression import Regression class Helper_Plot: def __init__(self): self.obj_data = Process_Data() self.obj_reg = Regression() self.obj_ml = Machine_Learning() def shows_table(self, root, df, width, height, title): frame = Toplevel(root) #new window self.table = Table(frame, dataframe=df, showtoolbar=True, showstatusbar=True) # Sets dimension of Toplevel frame.geometry(f"{width}x{height}") frame.title(title) self.table.show() def plot_missing_values(self, df, figure, canvas): figure.clear() ax = figure.add_subplot(1,1,1) #Plots null values missing = df.isna().sum().reset_index() missing.columns = ['features', 'total_missing'] missing['percent'] = (missing['total_missing'] / len(df)) * 100 missing.index = missing['features'] del missing['features'] missing['total_missing'].plot(kind = 'bar', ax=ax) ax.set_title('Missing Values Count', fontsize = 12) ax.set_facecolor('#F0F0F0') # Set font for tick labels ax.tick_params(axis='both', which='major', labelsize=5) ax.tick_params(axis='both', which='minor', labelsize=5) figure.tight_layout() canvas.draw() def plot_corr_coeffs(self, df, figure, canvas): figure.clear() ax = figure.add_subplot(1,1,1) #correlation coefficient of every column with Summary column all_corr = df.corr().abs()['Summary'].sort_values(ascending = False) # Filters correlations greater than 0.01 filtered_corr = all_corr[all_corr > 0.01] # Define a custom color palette (replace with your preferred colors) custom_palette = sns.color_palette("Set1", len(filtered_corr)) filtered_corr.plot(kind='barh', ax=ax, color=custom_palette) ax.set_title("Correlation Coefficient of Features with Summary (Threshold > 0.01)", fontsize = 9) ax.set_ylabel("Coefficient") ax.set_facecolor('#F0F0F0') # Set font for tick labels ax.tick_params(axis='both', which='major', labelsize=8) ax.tick_params(axis='both', which='minor', labelsize=8) ax.grid(True) figure.tight_layout() canvas.draw() # Defines function to create pie chart and bar plot as subplots def plot_piechart(self, df, var, figure, canvas, title=''): figure.clear() # Pie Chart (top subplot) ax1 = figure.add_subplot(2,1,1) label_list = list(df[var].value_counts().index) colors = sns.color_palette("Set1", len(label_list)) _, _, autopcts = ax1.pie(df[var].value_counts(), autopct="%1.1f%%", colors=colors, startangle=30, labels=label_list, wedgeprops={"linewidth": 2, "edgecolor": "white"}, # Add white edge shadow=True, textprops={'fontsize': 7}) ax1.set_title(title, fontsize=10) # Bar Plot (bottom subplot) ax2 = figure.add_subplot(2,1,2) ax = df[var].value_counts().plot(kind="barh", color=colors, alpha=0.8, ax = ax2) for i, j in enumerate(df[var].value_counts().values): ax.text(.7, i, j, weight="bold", fontsize=7) ax2.set_title(title, fontsize=10) ax2.set_xlabel("Count", fontsize=10) # Set font for tick labels ax2.tick_params(axis='both', which='major', labelsize=7) ax2.tick_params(axis='both', which='minor', labelsize=7) ax2.set_facecolor('#F0F0F0') figure.tight_layout() canvas.draw() def plot_piechart_group(self, df, figure, canvas, title=''): figure.clear() # Pie Chart (top subplot) ax1 = figure.add_subplot(2,1,1) label_list = list(df.value_counts().index) colors = sns.color_palette("Set1", len(label_list)) _, _, autopcts = ax1.pie(df.value_counts(), autopct="%1.1f%%", colors=colors, startangle=30, labels=label_list, wedgeprops={"linewidth": 2, "edgecolor": "white"}, # Add white edge shadow=True, textprops={'fontsize': 7}) ax1.set_title(title, fontsize=10) # Bar Plot (bottom subplot) ax2 = figure.add_subplot(2,1,2) ax = df.plot(kind="barh", color=colors, alpha=0.8, ax = ax2) for i, j in enumerate(df.values): ax.text(.7, i, j, weight="bold", fontsize=7) ax2.set_title(title, fontsize=10) ax2.set_xlabel("Count") ax2.set_facecolor('#F0F0F0') figure.tight_layout() canvas.draw() def plot_scatter(self, df, x, y, hue, figure, canvas): figure.clear() ax = figure.add_subplot(1,1,1) sns.scatterplot(data=df, x=x, y=y, hue=hue, palette="Set1", ax=ax) ax.set_title(x + " versus " + y + " by " + hue) ax.set_xlabel(x) ax.set_ylabel(y) ax.grid(True) ax.legend(facecolor='#E6E6FA', edgecolor='black') ax.set_facecolor('#F0F0F0') figure.tight_layout() canvas.draw() #Puts label inside stacked bar def put_label_stacked_bar(self, ax,fontsize): #patches is everything inside of the chart for rect in ax.patches: # Find where everything is located height = rect.get_height() width = rect.get_width() x = rect.get_x() y = rect.get_y() # The height of the bar is the data value and can be used as the label label_text = f'{width:.0f}' # ax.text(x, y, text) label_x = x + width / 2 label_y = y + height / 2 # plots only when height is greater than specified value if width > 0: ax.text(label_x, label_y, label_text, \ ha='center', va='center', \ weight = "bold",fontsize=fontsize) #Plots one variable against another variable def dist_one_vs_another_plot(self, df, cat1, cat2, figure, canvas, title): figure.clear() ax1 = figure.add_subplot(1,1,1) group_by_stat = df.groupby([cat1, cat2]).size() colors = sns.color_palette("Set1", len(df[cat1].unique())) group_by_stat.unstack().plot(kind='barh', stacked=True, ax=ax1,color=colors) ax1.set_title(title, fontsize=12) ax1.set_xlabel('Number of Cases', fontsize=10) ax1.set_ylabel(cat1, fontsize=10) self.put_label_stacked_bar(ax1,7) # Set font for tick labels ax1.tick_params(axis='both', which='major', labelsize=8) ax1.tick_params(axis='both', which='minor', labelsize=8) ax1.legend(facecolor='#E6E6FA', edgecolor='black', fontsize=8) ax1.set_facecolor('#F0F0F0') figure.tight_layout() canvas.draw() def box_plot(self, df, x, y, hue, figure, canvas, title): figure.clear() ax1 = figure.add_subplot(1,1,1) sns.boxplot(data = df, x = x, y = y, hue = hue, ax=ax1) ax1.set_title(title, fontsize=10) ax1.set_xlabel(x, fontsize=10) ax1.set_ylabel(y, fontsize=10) # Set font for tick labels ax1.tick_params(axis='both', which='major', labelsize=8) ax1.tick_params(axis='both', which='minor', labelsize=8) ax1.set_facecolor('#F0F0F0') ax1.legend(facecolor='#E6E6FA', edgecolor='black') # Get the legend from the axis legend = ax1.get_legend() # Set the title of the legend legend.set_title(hue) figure.tight_layout() canvas.draw() def plot_corr_mat(self, df, figure, canvas): figure.clear() ax = figure.add_subplot(1,1,1) categorical_columns = df.select_dtypes(include=['object', 'category']).columns df_removed = df.drop(columns=categorical_columns) corrdata = df_removed.corr() annot_kws = {"size": 5} # Filter correlations greater than 0.1 mask = abs(corrdata) > 0.1 filtered_corr = corrdata[mask] # Drops features that don't meet the threshold filtered_corr = filtered_corr.dropna(axis=0, how='all') filtered_corr = filtered_corr.dropna(axis=1, how='all') sns.heatmap(filtered_corr, ax = ax, lw=1, annot=True, cmap="Greens", annot_kws=annot_kws) ax.set_title('Correlation Matrix (Threshold > 0.1)', fontweight="bold", fontsize=10) # Set font for x and y labels ax.set_xlabel('Features', fontweight="bold", fontsize=12) ax.set_ylabel('Features', fontweight="bold", fontsize=12) # Set font for tick labels ax.tick_params(axis='both', which='major', labelsize=5) ax.tick_params(axis='both', which='minor', labelsize=5) figure.tight_layout() canvas.draw() def choose_plot(self, df1, df2, chosen, figure1, canvas1, figure2, canvas2): print(chosen) if chosen == "Day": self.plot_piechart(df2, "Day", figure1, canvas1, "Case Distribution of Day") elif chosen == "Month": self.plot_piechart(df2, "Month", figure2, canvas2, "Case Distribution of Month") elif chosen == "Quarter": self.plot_piechart(df2, "Quarter", figure1, canvas1, "Case Distribution of Quarter") elif chosen == "Year": self.plot_piechart(df2, "Year", figure2, canvas2, "Case Distribution of Year") elif chosen == "Missing Values": self.plot_missing_values(df1, figure1, canvas1) elif chosen == "Correlation Coefficient": self.plot_corr_coeffs(df1, figure2, canvas2) elif chosen == "Summary": self.plot_piechart(df2, chosen, figure1, canvas1, "Case Distribution of Weather Summary") elif chosen == "Daily Summary": self.plot_piechart(df2, "Daily_Summary", figure2, canvas2, "Case Distribution of Weather Daily Summary") elif chosen == "Precipitation Type": self.plot_piechart(df2, "Precip_Type", figure1, canvas1, "Case Distribution of Precipitation Type") elif chosen == "Temperature (C) versus Humidity versus Year": self.plot_scatter(df2, "Temperature", "Humidity", "Year", figure1, canvas1) elif chosen == "Wind Speed (km/h) versus Visibility (km) versus Quarter": self.plot_scatter(df2, "Wind_Speed", "Visibility", "Quarter_Cat", figure2, canvas2) elif chosen == "Pressure (millibars) versus Apparent Temperature (C) versus Month": self.plot_scatter(df2, "Pressure", "Wind_Bearing", "Month_Cat", figure1, canvas1) elif chosen == "Temperature (C) versus Visibility (km) versus Day": self.plot_scatter(df2, "Temperature", "Visibility", "Day_Cat", figure2, canvas2) elif chosen == "Day versus Categorized Temperature": self.dist_one_vs_another_plot(df2, "Day", "Temperature_Cat", figure1, canvas1, chosen) elif chosen == "Year versus Categorized Humidity": self.dist_one_vs_another_plot(df2, "Year", "Humidity_Cat", figure2, canvas2, chosen) elif chosen == "Week versus Categorized Wind Bearing Speed": self.dist_one_vs_another_plot(df2, "Week", "Wind_Bearing_Cat", figure1, canvas1, chosen) elif chosen == "Month versus Categorized Visibility": self.dist_one_vs_another_plot(df2, "Month", "Visibility_Cat", figure2, canvas2, chosen) elif chosen == "Quarter versus Categorized Humidity": self.dist_one_vs_another_plot(df2, "Quarter", "Humidity_Cat", figure1, canvas1, chosen) elif chosen == "Year versus Temperature Per Categorized Visibility": self.box_plot(df2, "Year", "Temperature", "Visibility_Cat", figure2, canvas2, chosen) elif chosen == "Month versus Wind Bearing Per Categorized Pressure": self.box_plot(df2, "Month_Cat", "Wind_Bearing", "Pressure_Cat", figure1, canvas1, chosen) elif chosen == "Quarter versus Humidity Per Categorized Wind Speed": self.box_plot(df2, "Quarter_Cat", "Humidity", "Wind_Speed_Cat", figure2, canvas2, chosen) elif chosen == "Day versus Temperature Per Categorized Temperature": self.box_plot(df2, "Day_Cat", "Temperature", "Temperature_Cat", figure1, canvas1, chosen) elif chosen == "Distribution of Categorized Visibility": self.plot_piechart(df2, "Visibility_Cat", figure2, canvas2, chosen) elif chosen == "Distribution of Categorized Temperature": self.plot_piechart(df2, "Temperature_Cat", figure1, canvas1, chosen) elif chosen == "Distribution of Categorized Wind Speed": self.plot_piechart(df2, "Wind_Speed_Cat", figure2, canvas2, chosen) elif chosen == "Distribution of Categorized Wind Bearing": self.plot_piechart(df2, "Wind_Bearing_Cat", figure1, canvas1, chosen) elif chosen == "Distribution of Categorized Pressure": self.plot_piechart(df2, "Pressure_Cat", figure2, canvas2, chosen) elif chosen == "Distribution of Categorized Humidity": self.plot_piechart(df2, "Humidity_Cat", figure1, canvas1, chosen) if chosen == "Correlation Matrix": self.plot_corr_mat(df1, figure2, canvas2) def line_plot_year_wise(self, df, feat1, feat2, year1, year2, figure, canvas): figure.clear() ax1 = figure.add_subplot(2, 1, 1) data1 = df[df["Year"]==year1] data2 = df[df["Year"]==year2] # Convert the column and index to NumPy arrays date_index1 = data1.index.to_numpy() date_index2 = data2.index.to_numpy() # Line plot ax1.plot(date_index1, data1[feat1].to_numpy(), color="red", marker='o', linestyle='-', linewidth=1, markersize=1, label=feat1) ax1.plot(date_index1,data1[feat2].to_numpy(), color="blue", marker='o', linestyle='-', linewidth=1, markersize=1, label=feat2) ax1.set_xlabel('YEAR') ax1.set_title(feat1 + " and " + feat2 + ' (YEAR = ' + str(year1) + ')', fontsize=12) ax1.legend(facecolor='#E6E6FA', edgecolor='black') ax1.set_facecolor('#F0F0F0') ax1.grid(True) ax2 = figure.add_subplot(2, 1, 2) ax2.plot(date_index2, data2[feat1].to_numpy(), color="red", marker='o', linestyle='-', linewidth=1, markersize=1, label=feat1) ax2.plot(date_index2, data2[feat2].to_numpy(), color="blue", marker='o', linestyle='-', linewidth=1, markersize=1, label=feat2) ax2.set_xlabel('YEAR') ax2.set_title(feat1 + " and " + feat2 + ' (YEAR = ' + str(year2) + ')', fontsize=12) ax2.legend(facecolor='#E6E6FA', edgecolor='black') ax2.set_facecolor('#F0F0F0') ax2.grid(True) figure.tight_layout() canvas.draw() def line_plot_norm_data(self, norm_data, figure, canvas, label, title): figure.clear() ax = figure.add_subplot(1, 1, 1) # Convert the column and index to NumPy arrays date_index = norm_data.index.to_numpy() # Iterate over all columns (excluding 'Date') for column in norm_data.columns: if column != 'Date': values = norm_data[column].to_numpy() ax.plot(values, date_index, marker='o', linestyle='-', linewidth=1, markersize=2, label=column) ax.set_ylabel(label) ax.set_title(title, fontsize=12) ax.legend(fontsize=7, facecolor='#E6E6FA', edgecolor='black') ax.set_facecolor('#F0F0F0') ax.grid(True) figure.tight_layout() canvas.draw() def line_plot_data_mean_ewm(self, data_mean, data_ewm, feat1, feat2, figure, canvas, label, title): figure.clear() ax1 = figure.add_subplot(2, 1, 1) # Convert the column and index to NumPy arrays date_index = data_mean.index.to_numpy() # Line plot ax1.plot(date_index, data_mean[feat1].to_numpy(), color="red", marker='o', linestyle='-', linewidth=2, markersize=1, label="Mean") ax1.plot(date_index,data_ewm[feat1].to_numpy(), color="blue", marker='o', linestyle='-', linewidth=2, markersize=1, label="EWM") ax1.set_title(title + feat1, fontsize=12) ax1.legend(facecolor='#E6E6FA', edgecolor='black') ax1.set_facecolor('#F0F0F0') ax1.grid(True) ax2 = figure.add_subplot(2, 1, 2) ax2.plot(date_index, data_mean[feat2].to_numpy(), color="red", marker='o', linestyle='-', linewidth=2, markersize=1, label="Mean") ax2.plot(date_index,data_ewm[feat2].to_numpy(), color="blue", marker='o', linestyle='-', linewidth=2, markersize=1, label="EWM") ax2.set_xlabel(label) ax2.set_title(title + feat2, fontsize=12) ax2.legend(facecolor='#E6E6FA', edgecolor='black') ax2.grid(True) ax2.set_facecolor('#F0F0F0') figure.tight_layout() canvas.draw() def box_violin_strip_heat(self, data, filter, feat1, figure1, canvas1, figure2, canvas2, year=""): figure1.clear() ax1 = figure1.add_subplot(2, 1, 1) sns.boxplot(x = filter, y = feat1, data = data, ax=ax1) ax1.set_title("Box Plot of " + feat1 + " by " + filter + " " + year, fontsize=12) # Set font for tick labels ax1.tick_params(axis='both', which='major', labelsize=6) ax1.tick_params(axis='both', which='minor', labelsize=6) ax1.grid(True) ax1.set_facecolor('#F0F0F0') ax2 = figure1.add_subplot(2, 1, 2) sns.violinplot(x = filter, y = feat1, data = data, ax=ax2) ax2.set_title("Violin Plot of " + feat1 + " by " + filter+ " " + year, fontsize=12) # Set font for tick labels ax2.tick_params(axis='both', which='major', labelsize=6) ax2.tick_params(axis='both', which='minor', labelsize=6) ax2.grid(True) ax2.set_facecolor('#F0F0F0') figure1.tight_layout() canvas1.draw() figure2.clear() ax3 = figure2.add_subplot(2, 1, 1) sns.stripplot(x = filter, y = feat1, data = data, ax=ax3) ax3.set_title("Strip Plot of " + feat1 + " by " + filter + " " + year, fontsize=12) # Set font for tick labels ax3.tick_params(axis='both', which='major', labelsize=6) ax3.tick_params(axis='both', which='minor', labelsize=6) ax3.set_facecolor('#F0F0F0') ax3.grid(True) ax4 = figure2.add_subplot(2, 1, 2) sns.swarmplot(x = filter, y = feat1, data = data, ax=ax4) ax4.set_title("Swarm Plot of " + feat1 + " by " + filter+ " " + year, fontsize=12) # Set font for tick labels ax4.tick_params(axis='both', which='major', labelsize=6) ax4.tick_params(axis='both', which='minor', labelsize=6) ax4.grid(True) ax4.set_facecolor('#F0F0F0') figure2.tight_layout() canvas2.draw() def choose_year_wise(self, df, data_mean, data_ewm, data_norm, chosen, figure1, canvas1, figure2, canvas2): if chosen == "Temperature (C) and Apparent Temperature (C)": self.line_plot_year_wise(df, "Temperature", "Apparent_Temperature", 2016, 2016, figure1, canvas1) if chosen == "Wind Speed (km/h) and Visibility (km)": self.line_plot_year_wise(df, "Wind_Speed", "Visibility", 2016, 2016, figure2, canvas2) if chosen == "Wind Bearing (degrees) and Pressure (millibars)": self.line_plot_year_wise(df, "Wind_Bearing", "Pressure", 2016, 2016, figure1, canvas1) if chosen == "Year-Wise Mean EWM Temperature (C) and Apparent Temperature (C)": self.line_plot_data_mean_ewm(data_mean, data_ewm, "Temperature", "Apparent_Temperature", figure2, canvas2, "YEAR", "Year-Wise Mean and EWM of ") if chosen == "Year-Wise Mean EWM Wind Speed (km/h) and Visibility (km)": self.line_plot_data_mean_ewm(data_mean, data_ewm, "Wind_Speed", "Visibility", figure1, canvas1, "YEAR", "Year-Wise Mean and EWM of ") if chosen == "Normalized Year-Wise Data": self.line_plot_norm_data(data_norm, figure1, canvas1, "YEAR", "Normalized Year-Wise Data") if chosen == "Temperature (C) by Year": self.box_violin_strip_heat(df[(df['Year'] == 2015) | (df['Year'] == 2016)], "Year", "Temperature", figure1, canvas1, figure2, canvas2, chosen) if chosen == "Wind Speed (km/h) by Year": self.box_violin_strip_heat(df[(df['Year'] == 2010) | (df['Year'] == 2011)], "Year", "Wind_Speed", figure1, canvas1, figure2, canvas2, chosen) if chosen == "Visibility (km) by Year": self.box_violin_strip_heat(df[(df['Year'] == 2012) | (df['Year'] == 2013)], "Year", "Visibility", figure1, canvas1, figure2, canvas2, chosen) if chosen == "Pressure (millibars) by Year": self.box_violin_strip_heat(df[(df['Year'] == 2014) | (df['Year'] == 2015)], "Year", "Pressure", figure1, canvas1, figure2, canvas2, chosen) if chosen == "Apparent Temperature (C) by Year": self.box_violin_strip_heat(df[(df['Year'] == 2007) | (df['Year'] == 2008)], "Year", "Apparent_Temperature", figure1, canvas1, figure2, canvas2, chosen) if chosen == "Wind Bearing (degrees) by Year": self.box_violin_strip_heat(df[(df['Year'] == 2008) | (df['Year'] == 2009)], "Year", "Wind_Bearing", figure1, canvas1, figure2, canvas2, chosen) def line_plot_month_wise(self, df, feat1, feat2, year, filter, filter1, filter2, figure, canvas): figure.clear() ax1 = figure.add_subplot(2, 1, 1) data1 = df[(df["Year"]==year)&(df[filter]==filter1)] data2 = df[(df["Year"]==year)&(df[filter]==filter2)] # Convert the column and index to NumPy arrays date_index1 = data1.index.to_numpy() date_index2 = data2.index.to_numpy() # Line plot ax1.plot(date_index1, data1[feat1].to_numpy(), color="red", marker='o', linestyle='-', linewidth=2, markersize=1, label=feat1) ax1.plot(date_index1,data1[feat2].to_numpy(), color="blue", marker='o', linestyle='-', linewidth=2, markersize=1, label=feat2) ax1.set_xlabel('DATE') ax1.set_title(feat1 + " and " + feat2 + " " + filter + " = " + filter1 + " " + str(year), fontsize=12) ax1.legend(facecolor='#E6E6FA', edgecolor='black') ax1.set_facecolor('#F0F0F0') ax1.grid(True) # Set font for tick labels ax1.tick_params(axis='both', which='major', labelsize=7) ax1.tick_params(axis='both', which='minor', labelsize=7) ax2 = figure.add_subplot(2, 1, 2) ax2.plot(date_index2, data2[feat1].to_numpy(), color="red", marker='o', linestyle='-', linewidth=2, markersize=1, label=feat1) ax2.plot(date_index2,data2[feat2].to_numpy(), color="blue", marker='o', linestyle='-', linewidth=2, markersize=1, label=feat2) ax2.set_xlabel('DATE') ax2.set_title(feat1 + " and " + feat2 + " " + filter + " = " + filter2 + " " + str(year), fontsize=12) ax2.legend(facecolor='#E6E6FA', edgecolor='black') ax2.set_facecolor('#F0F0F0') ax2.grid(True) # Set font for tick labels ax2.tick_params(axis='both', which='major', labelsize=7) ax2.tick_params(axis='both', which='minor', labelsize=7) figure.tight_layout() canvas.draw() def color_month(self, month): if month == 1: return 'January','blue' elif month == 2: return 'February','green' elif month == 3: return 'March','orange' elif month == 4: return 'April','yellow' elif month == 5: return 'May','red' elif month == 6: return 'June','violet' elif month == 7: return 'July','purple' elif month == 8: return 'August','black' elif month == 9: return 'September','brown' elif month == 10: return 'October','darkblue' elif month == 11: return 'November','grey' else: return 'December','pink' def line_plot_month(self, month, data, ax): label, color = self.color_month(month) mdata = data[data.index.month == month] date_index = mdata.index.to_numpy() ax.plot(date_index, mdata.to_numpy(), marker='o', linestyle='-', color=color, linewidth=2, markersize=1, label=label) def sns_plot_month(self, monthly_data, feat, title, figure, canvas): figure.clear() ax = figure.add_subplot(1, 1, 1) ax.set_title(title, fontsize=12) ax.set_xlabel('YEAR', fontsize=10) ax.set_ylabel(feat, fontsize=10) for i in range(1,13): self.line_plot_month(i, monthly_data[feat], ax) ax.legend(facecolor='#E6E6FA', edgecolor='black') ax.grid() ax.set_facecolor('#F0F0F0') figure.tight_layout() canvas.draw() def choose_month_wise(self, df2, data_mean, data_ewm, data_norm, chosen, figure1, canvas1, figure2, canvas2): if chosen == "Quarter-Wise Temperature (C) and Apparent Temperature (C)": self.line_plot_month_wise(df2, "Temperature", "Apparent_Temperature", 2016, "Quarter_Cat", "Jan-March", "April-June", figure1, canvas1) if chosen == "Quarter-Wise Wind Speed (km/h) and Visibility (km)": self.line_plot_month_wise(df2, "Wind_Speed", "Visibility", 2016, "Quarter_Cat", "July-Sept", "Oct-Dec", figure2, canvas2) if chosen == "Month-Wise Temperature (C) and Apparent Temperature (C)": self.line_plot_month_wise(df2, "Temperature", "Apparent_Temperature", 2016, "Month_Cat", "February", "March", figure1, canvas1) self.line_plot_month_wise(df2, "Temperature", "Apparent_Temperature", 2016, "Month_Cat", "April", "May", figure2, canvas2) if chosen == "Month-Wise Mean EWM Wind Speed (km/h) and Visibility (km)": self.line_plot_data_mean_ewm(data_mean, data_ewm, "Wind_Speed", "Visibility", figure1, canvas1, "MONTH", "Month-Wise Mean and EWM of ") if chosen == "Month-Wise Mean EWM Temperature (C) and Apparent Temperature (C)": self.line_plot_data_mean_ewm(data_mean, data_ewm, "Temperature", "Apparent_Temperature", figure2, canvas2, "MONTH", "Month-Wise Mean and EWM of ") if chosen == "Month-Wise Temperature (C)": self.sns_plot_month(data_mean, "Temperature", chosen, figure1, canvas1) if chosen == "Month-Wise Apparent Temperature (C)": self.sns_plot_month(data_mean, "Apparent_Temperature", chosen, figure2, canvas2) if chosen == "Month-Wise Wind Speed (km/h)": self.sns_plot_month(data_mean, "Wind_Speed", chosen, figure1, canvas1) if chosen == "Month-Wise Wind Bearing (degrees)": self.sns_plot_month(data_mean, "Wind_Bearing", chosen, figure2, canvas2) if chosen == "Month-Wise Visibility (km)": self.sns_plot_month(data_mean, "Visibility", chosen, figure1, canvas1) if chosen == "Month-Wise Pressure (millibars)": self.sns_plot_month(data_mean, "Pressure", chosen, figure2, canvas2) if chosen == "Month-Wise Humidity": self.sns_plot_month(data_mean, "Humidity", chosen, figure1, canvas1) if chosen == "Normalized Month-Wise Data": self.line_plot_norm_data(data_norm, figure1, canvas1, "YEAR", "Normalized Month-Wise Data") if chosen == "Temperature (C) by Month": self.box_violin_strip_heat(df2[df2["Year"] == 2016], "Month_Cat", "Temperature", figure1, canvas1, figure2, canvas2, "2016") if chosen == "Wind Speed (km/h) by Month": self.box_violin_strip_heat(df2[df2["Year"] == 2016], "Month_Cat", "Wind_Speed", figure1, canvas1, figure2, canvas2, "2016") if chosen == "Apparent Temperature (C) by Month": self.box_violin_strip_heat(df2[df2["Year"] == 2016], "Month_Cat", "Apparent_Temperature", figure1, canvas1, figure2, canvas2, "2016") if chosen == "Wind Bearing (degrees) by Month": self.box_violin_strip_heat(df2[df2["Year"] == 2016], "Month_Cat", "Wind_Bearing", figure1, canvas1, figure2, canvas2, "2016") if chosen == "Visibility (km) by Month": self.box_violin_strip_heat(df2[df2["Year"] == 2016], "Month_Cat", "Visibility", figure1, canvas1, figure2, canvas2, "2016") if chosen == "Pressure (millibars) by Month": self.box_violin_strip_heat(df2[df2["Year"] == 2016], "Month_Cat", "Pressure", figure1, canvas1, figure2, canvas2, "2016") def plot_rf_importance(self, X, y, figure, canvas): result_rf = self.obj_data.feat_importance_rf(X, y) figure.clear() plot1 = figure.add_subplot(1,1,1) sns.set_color_codes("pastel") ax=sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Blue", ax=plot1) plot1.set_title('Random Forest Features Importance', fontweight ="bold",fontsize=14) plot1.set_xlabel('Features Importance', fontsize=10) plot1.set_ylabel('Feature Labels', fontsize=10) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=7) plot1.tick_params(axis='both', which='minor', labelsize=7) figure.tight_layout() canvas.draw() def plot_et_importance(self, X, y, figure, canvas): result_rf = self.obj_data.feat_importance_et(X, y) figure.clear() plot1 = figure.add_subplot(1,1,1) sns.set_color_codes("pastel") ax=sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Red", ax=plot1) plot1.set_title('Extra Trees Features Importance', fontweight ="bold",fontsize=14) plot1.set_xlabel('Features Importance', fontsize=10) plot1.set_ylabel('Feature Labels', fontsize=10) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=7) plot1.tick_params(axis='both', which='minor', labelsize=7) figure.tight_layout() canvas.draw() def plot_rfe_importance(self, X, y, figure, canvas): result_lg = self.obj_data.feat_importance_rfe(X, y) figure.clear() plot1 = figure.add_subplot(1,1,1) sns.set_color_codes("pastel") ax=sns.barplot(x = 'Ranking',y = 'Features', data=result_lg, color="orange", ax=plot1) plot1.set_title('RFE Features Importance', fontweight ="bold",fontsize=14) plot1.set_xlabel('Features Importance', fontsize=10) plot1.set_ylabel('Feature Labels', fontsize=10) # Set font for tick labels plot1.tick_params(axis='both', which='major', labelsize=7) plot1.tick_params(axis='both', which='minor', labelsize=7) figure.tight_layout() canvas.draw() def choose_feat_importance(self, chosen, X, y, figure1, canvas1, figure2, canvas2): if chosen == "RF Features Importance": self.plot_rf_importance(X, y, figure1, canvas1) if chosen == "ET Features Importance": self.plot_et_importance(X, y, figure2, canvas2) if chosen == "RFE Features Importance": self.plot_rfe_importance(X, y, figure1, canvas1) def scatter_train_test_regression(self, ytrain, ytest, predictions_train, predictions_test, figure, canvas, label): # Visualizes the training set results in a scatter plot figure.clear() ax1 = figure.add_subplot(2, 1, 1) ax1.scatter(x=ytrain, y=predictions_train, s=3, color='red', label='Training Data') ax1.set_title('The actual versus predicted (Training set): ' + label, fontweight='bold', fontsize=10) ax1.set_xlabel('Actual Train Set', fontsize=8) ax1.set_ylabel('Predicted Train Set', fontsize=8) ax1.plot([ytrain.min(), ytrain.max()], [ytrain.min(), ytrain.max()], 'b--', linewidth=2, label='Perfect Prediction') ax1.grid(True) ax1.set_facecolor('#F0F0F0') ax1.legend(facecolor='#E6E6FA', edgecolor='black') ax2 = figure.add_subplot(2, 1, 2) ax2.scatter(x=ytest, y=predictions_test, s=3, color='red', label='Test Data') ax2.set_title('The actual versus predicted (Test set): ' + label, fontweight='bold', fontsize=10) ax2.set_xlabel('Actual Test Set', fontsize=8) ax2.set_ylabel('Predicted Test Set', fontsize=8) ax2.plot([ytest.min(), ytest.max()], [ytest.min(), ytest.max()], 'b--', linewidth=2, label='Perfect Prediction') ax2.grid(True) ax2.set_facecolor('#F0F0F0') ax2.legend(facecolor='#E6E6FA', edgecolor='black') figure.tight_layout() canvas.draw() def lineplot_train_test_regression(self, ytrain, ytest, yval, yfinal, predictions_train, predictions_test, predictions_val, all_pred, figure, canvas, label): figure.clear() ax1 = figure.add_subplot(4, 1, 1) ax1.plot(ytrain.index.to_numpy(), ytrain.to_numpy(), color="blue", linewidth=1, linestyle="-", label='Actual') ax1.plot(ytrain.index.to_numpy(), predictions_train, color="red", linewidth=1, linestyle="-", label='Predicted') ax1.set_title('Actual and Predicted Training Set: ' + label, fontsize=10) ax1.set_xlabel('Date', fontsize=8) ax1.set_ylabel("Temperature", fontsize=8) ax1.legend(prop={'size': 8},facecolor='#E6E6FA', edgecolor='black') ax1.grid(True) ax1.set_facecolor('#F0F0F0') # Set font for tick labels ax1.tick_params(axis='both', which='major', labelsize=8) ax1.tick_params(axis='both', which='minor', labelsize=8) ax2 = figure.add_subplot(4, 1, 2) ax2.plot(ytest.index.to_numpy(), ytest.to_numpy(), color="blue", linewidth=1, linestyle="-", label='Actual') ax2.plot(ytest.index.to_numpy(), predictions_test, color="red", linewidth=1, linestyle="-", label='Predicted') ax2.set_title('Actual and Predicted Test Set: ' + label, fontsize=10) ax2.set_xlabel('Date', fontsize=8) ax2.set_ylabel("Temperature", fontsize=8) ax2.legend(prop={'size': 8}, facecolor='#E6E6FA', edgecolor='black') ax2.grid(True) ax2.set_facecolor('#F0F0F0') # Set font for tick labels ax2.tick_params(axis='both', which='major', labelsize=8) ax2.tick_params(axis='both', which='minor', labelsize=8) ax3 = figure.add_subplot(4, 1, 3) ax3.plot(yval.index.to_numpy(), yval.to_numpy(), color="blue", linewidth=1, linestyle="-", label='Actual') ax3.plot(yval.index.to_numpy(), predictions_val, color="red", linewidth=1, linestyle="-", label='Predicted') ax3.set_title('Actual and Predicted Validation Set (90 days forecasting) ' + label, fontsize=8) ax3.set_xlabel('Date', fontsize=8) ax3.set_ylabel("Temperature", fontsize=8) ax3.legend(prop={'size': 8}, facecolor='#E6E6FA', edgecolor='black') ax3.grid(True) ax3.set_facecolor('#F0F0F0') # Set font for tick labels ax3.tick_params(axis='both', which='major', labelsize=8) ax3.tick_params(axis='both', which='minor', labelsize=8) ax4 = figure.add_subplot(4, 1, 4) ax4.plot(yfinal.index.to_numpy(), yfinal.to_numpy(), color="blue", linewidth=1, linestyle="-", label='Actual') ax4.plot(yfinal.index.to_numpy(), all_pred, color="red", linewidth=1, linestyle="-", label='Predicted') ax4.set_title('Actual and Predicted All Set ' + label, fontsize=8) ax4.set_xlabel('Date', fontsize=8) ax4.set_ylabel("Temperature", fontsize=8) ax4.legend(prop={'size': 8}, facecolor='#E6E6FA', edgecolor='black') ax4.grid(True) ax4.set_facecolor('#F0F0F0') # Set font for tick labels ax4.tick_params(axis='both', which='major', labelsize=8) ax4.tick_params(axis='both', which='minor', labelsize=8) figure.tight_layout() canvas.draw() def choose_plot_regression(self, chosen, X_final_reg, X_train_reg, X_test_reg, X_val_reg, y_final_reg, y_train_reg, y_test_reg, y_val_reg, figure1, canvas1, figure2, canvas2): if chosen == "Linear Regression": best_lin_reg = self.obj_reg.linear_regression(X_train_reg, y_train_reg) predictions_test, predictions_train, predictions_val, all_pred = self.obj_reg.perform_regression(best_lin_reg, X_final_reg, y_final_reg, X_train_reg, y_train_reg, X_test_reg, y_test_reg, X_val_reg, y_val_reg, chosen) self.scatter_train_test_regression(y_train_reg, y_test_reg, predictions_train, predictions_test, figure1, canvas1, chosen) self.lineplot_train_test_regression(y_train_reg, y_test_reg, y_val_reg, y_final_reg, predictions_train, predictions_test, predictions_val, all_pred, figure2, canvas2, chosen) if chosen == "RF Regression": best_rf_reg = self.obj_reg.rf_regression(X_train_reg, y_train_reg) predictions_test, predictions_train, predictions_val, all_pred = self.obj_reg.perform_regression(best_rf_reg, X_final_reg, y_final_reg, X_train_reg, y_train_reg, X_test_reg, y_test_reg, X_val_reg, y_val_reg, chosen) self.scatter_train_test_regression(y_train_reg, y_test_reg, predictions_train, predictions_test, figure1, canvas1, chosen) self.lineplot_train_test_regression(y_train_reg, y_test_reg, y_val_reg, y_final_reg, predictions_train, predictions_test, predictions_val, all_pred, figure2, canvas2, chosen) if chosen == "Decision Trees Regression": best_dt_reg = self.obj_reg.dt_regression(X_train_reg, y_train_reg) predictions_test, predictions_train, predictions_val, all_pred = self.obj_reg.perform_regression(best_dt_reg, X_final_reg, y_final_reg, X_train_reg, y_train_reg, X_test_reg, y_test_reg, X_val_reg, y_val_reg, chosen) self.scatter_train_test_regression(y_train_reg, y_test_reg, predictions_train, predictions_test, figure1, canvas1, chosen) self.lineplot_train_test_regression(y_train_reg, y_test_reg, y_val_reg, y_final_reg, predictions_train, predictions_test, predictions_val, all_pred, figure2, canvas2, chosen) if chosen == "Gradient Boosting Regression": best_gb_reg = self.obj_reg.gb_regression(X_train_reg, y_train_reg) predictions_test, predictions_train, predictions_val, all_pred = self.obj_reg.perform_regression(best_gb_reg, X_final_reg, y_final_reg, X_train_reg, y_train_reg, X_test_reg, y_test_reg, X_val_reg, y_val_reg, chosen) self.scatter_train_test_regression(y_train_reg, y_test_reg, predictions_train, predictions_test, figure1, canvas1, chosen) self.lineplot_train_test_regression(y_train_reg, y_test_reg, y_val_reg, y_final_reg, predictions_train, predictions_test, predictions_val, all_pred, figure2, canvas2, chosen) if chosen == "XGB Regression": best_xgb_reg = self.obj_reg.xgb_regression(X_train_reg, y_train_reg) predictions_test, predictions_train, predictions_val, all_pred = self.obj_reg.perform_regression(best_xgb_reg, X_final_reg, y_final_reg, X_train_reg, y_train_reg, X_test_reg, y_test_reg, X_val_reg, y_val_reg, chosen) self.scatter_train_test_regression(y_train_reg, y_test_reg, predictions_train, predictions_test, figure1, canvas1, chosen) self.lineplot_train_test_regression(y_train_reg, y_test_reg, y_val_reg, y_final_reg, predictions_train, predictions_test, predictions_val, all_pred, figure2, canvas2, chosen) if chosen == "MLP Regression": best_mlp_reg = self.obj_reg.mlp_regression(X_train_reg, y_train_reg) predictions_test, predictions_train, predictions_val, all_pred = self.obj_reg.perform_regression(best_mlp_reg, X_final_reg, y_final_reg, X_train_reg, y_train_reg, X_test_reg, y_test_reg, X_val_reg, y_val_reg, chosen) self.scatter_train_test_regression(y_train_reg, y_test_reg, predictions_train, predictions_test, figure1, canvas1, chosen) self.lineplot_train_test_regression(y_train_reg, y_test_reg, y_val_reg, y_final_reg, predictions_train, predictions_test, predictions_val, all_pred, figure2, canvas2, chosen) if chosen == "Lasso Regression": best_lasso_reg = self.obj_reg.lasso_regression(X_train_reg, y_train_reg) predictions_test, predictions_train, predictions_val, all_pred = self.obj_reg.perform_regression(best_lasso_reg, X_final_reg, y_final_reg, X_train_reg, y_train_reg, X_test_reg, y_test_reg, X_val_reg, y_val_reg, chosen) self.scatter_train_test_regression(y_train_reg, y_test_reg, predictions_train, predictions_test, figure1, canvas1, chosen) self.lineplot_train_test_regression(y_train_reg, y_test_reg, y_val_reg, y_final_reg, predictions_train, predictions_test, predictions_val, all_pred, figure2, canvas2, chosen) if chosen == "Ridge Regression": best_ridge_reg = self.obj_reg.ridge_regression(X_train_reg, y_train_reg) predictions_test, predictions_train, predictions_val, all_pred = self.obj_reg.perform_regression(best_ridge_reg, X_final_reg, y_final_reg, X_train_reg, y_train_reg, X_test_reg, y_test_reg, X_val_reg, y_val_reg, chosen) self.scatter_train_test_regression(y_train_reg, y_test_reg, predictions_train, predictions_test, figure1, canvas1, chosen) self.lineplot_train_test_regression(y_train_reg, y_test_reg, y_val_reg, y_final_reg, predictions_train, predictions_test, predictions_val, all_pred, figure2, canvas2, chosen) if chosen == "AdaBoost Regression": best_ada_reg = self.obj_reg.adaboost_regression(X_train_reg, y_train_reg) predictions_test, predictions_train, predictions_val, all_pred = self.obj_reg.perform_regression(best_ada_reg, X_final_reg, y_final_reg, X_train_reg, y_train_reg, X_test_reg, y_test_reg, X_val_reg, y_val_reg, chosen) self.scatter_train_test_regression(y_train_reg, y_test_reg, predictions_train, predictions_test, figure1, canvas1, chosen) self.lineplot_train_test_regression(y_train_reg, y_test_reg, y_val_reg, y_final_reg, predictions_train, predictions_test, predictions_val, all_pred, figure2, canvas2, chosen) if chosen == "KNN Regression": best_knn_reg = self.obj_reg.knn_regression(X_train_reg, y_train_reg) predictions_test, predictions_train, predictions_val, all_pred = self.obj_reg.perform_regression(best_knn_reg, X_final_reg, y_final_reg, X_train_reg, y_train_reg, X_test_reg, y_test_reg, X_val_reg, y_val_reg, chosen) self.scatter_train_test_regression(y_train_reg, y_test_reg, predictions_train, predictions_test, figure1, canvas1, chosen) self.lineplot_train_test_regression(y_train_reg, y_test_reg, y_val_reg, y_final_reg, predictions_train, predictions_test, predictions_val, all_pred, figure2, canvas2, chosen) def plot_cm(self, y_test, ypred, name, figure, canvas): figure.clear() #Plots confusion matrix ax1 = figure.add_subplot(1,1,1) cm = confusion_matrix(y_test, ypred, ) sns.heatmap(cm, annot=True, linewidth=3, linecolor='red', fmt='g', cmap="viridis", annot_kws={"size": 14}, ax=ax1) ax1.set_title('Confusion Matrix' + " of " + name, fontsize=12) ax1.set_xlabel('Y predict', fontsize=10) ax1.set_ylabel('Y test', fontsize=10) ax1.set_facecolor('#F0F0F0') figure.tight_layout() canvas.draw() #Plots true values versus predicted values diagram and learning curve def plot_real_pred_val_learning_curve(self, model, X_train, y_train, X_test, y_test, ypred, name, figure, canvas): figure.clear() #Plots true values versus predicted values diagram ax1 = figure.add_subplot(2,1,1) acc=accuracy_score(y_test, ypred) ax1.scatter(range(len(ypred)),ypred, color="blue", lw=2,label="Predicted") ax1.scatter(range(len(y_test)), y_test, color="red", label="Actual") ax1.set_title("Predicted Values vs True Values of " + name, fontsize=12) ax1.set_xlabel("Accuracy: " + str(round((acc*100),3)) + "%") ax1.legend(facecolor='#E6E6FA', edgecolor='black') ax1.grid(True, alpha=0.75, lw=1, ls='-.') ax1.set_facecolor('#F0F0F0') #Plots learning curve train_sizes=np.linspace(.1, 1.0, 5) train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(model, X_train, y_train, cv=None, n_jobs=None, train_sizes=train_sizes, return_times=True) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) ax2 = figure.add_subplot(2,1,2) ax2.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") ax2.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") ax2.plot(train_sizes, train_scores_mean, 'o-', color="b", label="Training score") ax2.plot(train_sizes, test_scores_mean, 'o-', color="r", label="Cross-validation score") ax2.legend(loc="best", facecolor='#E6E6FA', edgecolor='black') ax2.set_title("Learning curve of " + name, fontsize=12) ax2.set_xlabel("fit_times") ax2.set_ylabel("Score") ax2.grid(True, alpha=0.75, lw=1, ls='-.') ax2.set_facecolor('#F0F0F0') figure.tight_layout() canvas.draw() def choose_plot_ML(self, root, chosen, X_train, X_test, y_train, y_test, figure1, canvas1, figure2, canvas2): if chosen == "Logistic Regression": best_model, y_pred = self.obj_ml.implement_LR(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix self.plot_cm(y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_lr = self.obj_data.read_dataset("results_LR.csv") self.shows_table(root, df_lr, 450, 750, "Y_test and Y_pred of Logistic Regression") if chosen == "Random Forest": best_model, y_pred = self.obj_ml.implement_RF(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix self.plot_cm(y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_rf = self.obj_data.read_dataset("results_RF.csv") self.shows_table(root, df_rf, 450, 750, "Y_test and Y_pred of Random Forest") if chosen == "K-Nearest Neighbors": best_model, y_pred = self.obj_ml.implement_KNN(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix self.plot_cm(y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_knn = self.obj_data.read_dataset("results_KNN.csv") self.shows_table(root, df_knn, 450, 750, "Y_test and Y_pred of KNN") if chosen == "Decision Trees": best_model, y_pred = self.obj_ml.implement_DT(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix self.plot_cm(y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_dt = self.obj_data.read_dataset("results_DT.csv") self.shows_table(root, df_dt, 450, 750, "Y_test and Y_pred of Decision Trees") if chosen == "Gradient Boosting": best_model, y_pred = self.obj_ml.implement_GB(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix self.plot_cm(y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_gb = self.obj_data.read_dataset("results_GB.csv") self.shows_table(root, df_gb, 450, 750, "Y_test and Y_pred of Gradient Boosting") if chosen == "Extreme Gradient Boosting": best_model, y_pred = self.obj_ml.implement_XGB(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix self.plot_cm(y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_xgb = self.obj_data.read_dataset("results_XGB.csv") self.shows_table(root, df_xgb, 450, 750, "Y_test and Y_pred of Extreme Gradient Boosting") if chosen == "Multi-Layer Perceptron": best_model, y_pred = self.obj_ml.implement_MLP(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix self.plot_cm(y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_mlp = self.obj_data.read_dataset("results_MLP.csv") self.shows_table(root, df_mlp, 450, 750, "Y_test and Y_pred of Multi-Layer Perceptron") if chosen == "Support Vector Classifier": best_model, y_pred = self.obj_ml.implement_SVC(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix self.plot_cm(y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_svc = self.obj_data.read_dataset("results_SVC.csv") self.shows_table(root, df_svc, 450, 750, "Y_test and Y_pred of Support Vector Classifier") if chosen == "AdaBoost": best_model, y_pred = self.obj_ml.implement_ADA(chosen, X_train, X_test, y_train, y_test) #Plots confusion matrix self.plot_cm(y_test, y_pred, chosen, figure1, canvas1) #Plots true values versus predicted values diagram and learning curve self.plot_real_pred_val_learning_curve(best_model, X_train, y_train, X_test, y_test, y_pred, chosen, figure2, canvas2) #Shows table of result df_ada = self.obj_data.read_dataset("results_ADA.csv") self.shows_table(root, df_ada, 450, 750, "Y_test and Y_pred of AdaBoost Classifier") #regression.py import pandas as pd import numpy as np from sklearn.preprocessing import MinMaxScaler import joblib from sklearn.metrics import mean_squared_error, mean_absolute_error from sklearn.metrics import roc_auc_score,roc_curve, r2_score, explained_variance_score from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor from sklearn.model_selection import GridSearchCV from sklearn.tree import DecisionTreeRegressor from xgboost import XGBRegressor from sklearn.neural_network import MLPRegressor from sklearn.linear_model import LassoCV from sklearn.linear_model import RidgeCV from sklearn.neighbors import KNeighborsRegressor class Regression: def splitting_data_regression(self, X, y_final): #Normalizes data scaler = MinMaxScaler() X_minmax_data = scaler.fit_transform(X) X_final = pd.DataFrame(columns=X.columns, data=X_minmax_data, index=X.index) print('Shape of features : ', X_final.shape) print('Shape of target : ', y_final.shape) #Shifts target array to predict the n + 1 samples n=90 y_final = y_final.shift(-1) y_val = y_final[-n:-1] y_final = y_final[:-n] #Takes last n rows of data to be validation set X_val = X_final[-n:-1] X_final = X_final[:-n] print("\n -----After process------ \n") print('Shape of features : ', X_final.shape) print('Shape of target : ', y_final.shape) print(y_final.tail().to_string()) y_final=y_final.astype('float64') #Splits data into training and test data at 80% and 20% respectively split_idx=round(0.8*len(X)) print("split_idx=",split_idx) X_train_reg = X_final[:split_idx] y_train_reg = y_final[:split_idx] X_test_reg = X_final[split_idx:] y_test_reg = y_final[split_idx:] #Saves into pkl files joblib.dump(X, 'X_Ori.pkl') joblib.dump(X_final, 'X_final_reg.pkl') joblib.dump(X_train_reg, 'X_train_reg.pkl') joblib.dump(X_test_reg, 'X_test_reg.pkl') joblib.dump(X_val, 'X_val_reg.pkl') joblib.dump(y_final, 'y_final_reg.pkl') joblib.dump(y_train_reg, 'y_train_reg.pkl') joblib.dump(y_test_reg, 'y_test_reg.pkl') joblib.dump(y_val, 'y_val_reg.pkl') def load_regression_files(self): X_Ori = joblib.load('X_Ori.pkl') X_final_reg = joblib.load('X_final_reg.pkl') X_train_reg = joblib.load('X_train_reg.pkl') X_test_reg = joblib.load('X_test_reg.pkl') X_val_reg = joblib.load('X_val_reg.pkl') y_final_reg = joblib.load('y_final_reg.pkl') y_train_reg = joblib.load('y_train_reg.pkl') y_test_reg = joblib.load('y_test_reg.pkl') y_val_reg = joblib.load('y_val_reg.pkl') return X_Ori, X_final_reg, X_train_reg, X_test_reg, X_val_reg, y_final_reg, y_train_reg, y_test_reg, y_val_reg def perform_regression(self, model, X, y, xtrain, ytrain, xtest, ytest, xval, yval, label): model.fit(xtrain, ytrain) predictions_test = model.predict(xtest) predictions_train = model.predict(xtrain) predictions_val = model.predict(xval) # Convert ytest and predictions_test to NumPy arrays ytest_np = ytest.to_numpy().flatten() predictions_test_np = predictions_test.flatten() str_label = 'RMSE using ' + label print(str_label + f': {np.sqrt(mean_squared_error(ytest_np, predictions_test_np))}') print("mean square error: ", mean_squared_error(ytest_np, predictions_test_np)) print("variance or r-squared: ", explained_variance_score(ytest_np, predictions_test_np)) print("mean absolute error (MAE): ", mean_absolute_error(ytest_np, predictions_test_np)) print("R2 (R-squared): ", r2_score(ytest_np, predictions_test_np)) print("Adjusted R2: ", 1 - (1-r2_score(ytest_np, predictions_test_np))*(len(ytest_np)-1)/(len(ytest_np)-xtest.shape[1]-1)) mean_percentage_error = np.mean((ytest_np - predictions_test_np) / ytest_np) * 100 print("Mean Percentage Error (MPE): ", mean_percentage_error) mean_absolute_percentage_error = np.mean(np.abs((ytest_np - predictions_test_np) / ytest_np)) * 100 print("Mean Absolute Percentage Error (MAPE): ", mean_absolute_percentage_error) print('ACTUAL: Avg. ' + f': {ytest_np.mean()}') print('ACTUAL: Median ' + f': {np.median(ytest_np)}') print('PREDICTED: Avg. ' + f': {predictions_test_np.mean()}') print('PREDICTED: Median ' + f': {np.median(predictions_test_np)}') # Evaluation of regression on all dataset all_pred = model.predict(X) print("mean square error (whole dataset): ", mean_squared_error(y, all_pred)) print("variance or r-squared (whole dataset): ", explained_variance_score(y, all_pred)) return predictions_test, predictions_train, predictions_val, all_pred def linear_regression(self, X_train, y_train): #Linear Regression #Creates a Linear Regression model lin_reg = LinearRegression() #Defines the hyperparameter grid to search param_grid = { 'fit_intercept': [True, False], # Try both True and False for fit_intercept 'normalize': [True, False] # Try both True and False for normalize } #Creates GridSearchCV with the Linear Regression model and the hyperparameter grid grid_search = GridSearchCV(lin_reg, param_grid, cv=5, scoring='neg_mean_squared_error') #Fits the GridSearchCV to the training data grid_search.fit(X_train, y_train) #Gets the best Linear Regression model from the grid search best_lin_reg = grid_search.best_estimator_ #Prints the best hyperparameters found print("Best Hyperparameters for Linear Regression:") print(grid_search.best_params_) return best_lin_reg def rf_regression(self, X_train, y_train): #Random Forest Regression # Create a RandomForestRegressor model rf_reg = RandomForestRegressor() # Define the hyperparameter grid to search param_grid = { 'n_estimators': [50, 100, 150], # Number of trees in the forest 'max_depth': [None, 5, 10], # Maximum depth of the tree 'min_samples_split': [2, 5, 10], # Minimum number of samples required to split an internal node 'min_samples_leaf': [1, 2, 4], # Minimum number of samples required to be at a leaf node 'bootstrap': [True, False] # Whether bootstrap samples are used when building trees } # Create GridSearchCV with the RandomForestRegressor model and the hyperparameter grid grid_search = GridSearchCV(rf_reg, param_grid, cv=5, scoring='neg_mean_squared_error') # Fit the GridSearchCV to the training data grid_search.fit(X_train, y_train) # Get the best RandomForestRegressor model from the grid search best_rf_reg = grid_search.best_estimator_ # Print the best hyperparameters found print("Best Hyperparameters for RandomForestRegressor:") print(grid_search.best_params_) return best_rf_reg def dt_regression(self, X_train, y_train): #Decision Tree (DT) regression # Create a DecisionTreeRegressor model dt_reg = DecisionTreeRegressor(random_state=100) # Define the hyperparameter grid to search param_grid = { 'max_depth': [None, 5, 10, 15], # Maximum depth of the tree 'min_samples_split': [2, 5, 10], # Minimum number of samples required to split an internal node 'min_samples_leaf': [1, 2, 4, 6], # Minimum number of samples required to be at a leaf node } # Create GridSearchCV with the DecisionTreeRegressor model and the hyperparameter grid grid_search = GridSearchCV(dt_reg, param_grid, cv=5, scoring='neg_mean_squared_error') # Fit the GridSearchCV to the training data grid_search.fit(X_train, y_train) # Get the best DecisionTreeRegressor model from the grid search best_dt_reg = grid_search.best_estimator_ # Print the best hyperparameters found print("Best Hyperparameters for DecisionTreeRegressor:") print(grid_search.best_params_) return best_dt_reg def gb_regression(self, X_train, y_train): #Gradient Boosting regression # Create the GradientBoostingRegressor model gb_reg = GradientBoostingRegressor() # Define the hyperparameter grid to search param_grid = { 'n_estimators': [50, 100, 150], # Number of boosting stages (trees) to build 'learning_rate': [0.01, 0.1, 0.5], # Step size at each boosting iteration 'max_depth': [3, 5, 7], # Maximum depth of the individual trees 'min_samples_split': [2, 5, 10], # Minimum number of samples required to split an internal node 'min_samples_leaf': [1, 2, 4], # Minimum number of samples required to be at a leaf node } # Create GridSearchCV with the GradientBoostingRegressor model and the hyperparameter grid grid_search = GridSearchCV(gb_reg, param_grid, cv=5, scoring='neg_mean_squared_error') # Fit the GridSearchCV to the training data grid_search.fit(X_train, y_train) # Get the best GradientBoostingRegressor model from the grid search best_gb_reg = grid_search.best_estimator_ # Print the best hyperparameters found print("Best Hyperparameters for GradientBoostingRegressor:") print(grid_search.best_params_) return best_gb_reg def xgb_regression(self, X_train, y_train): #Extreme Gradient Boosting (XGB) # Create the XGBRegressor model xgb_reg = XGBRegressor() # Define the hyperparameter grid to search param_grid = { 'n_estimators': [50, 100, 150], # Number of boosting stages (trees) to build 'learning_rate': [0.01, 0.1, 0.5], # Step size at each boosting iteration 'max_depth': [3, 5, 7], # Maximum depth of the individual trees 'min_child_weight': [1, 2, 4], # Minimum sum of instance weight (hessian) needed in a child 'gamma': [0, 0.1, 0.2], # Minimum loss reduction required to make a further partition on a leaf node 'subsample': [0.8, 1.0], # Subsample ratio of the training instances 'colsample_bytree': [0.8, 1.0] # Subsample ratio of columns when constructing each tree } # Create GridSearchCV with the XGBRegressor model and the hyperparameter grid grid_search = GridSearchCV(xgb_reg, param_grid, cv=5, scoring='neg_mean_squared_error') # Fit the GridSearchCV to the training data grid_search.fit(X_train, y_train) # Get the best XGBRegressor model from the grid search best_xgb_reg = grid_search.best_estimator_ # Print the best hyperparameters found print("Best Hyperparameters for XGBRegressor:") print(grid_search.best_params_) return best_xgb_reg def mlp_regression(self, X_train, y_train): #MLP regression # Create the MLPRegressor model mlp_reg = MLPRegressor() # Define the hyperparameter grid to search param_grid = { 'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)], # Number of neurons in each hidden layer 'activation': ['relu', 'tanh'], # Activation function for the hidden layers 'solver': ['adam', 'sgd'], # Solver for weight optimization 'learning_rate': ['constant', 'invscaling', 'adaptive'], # Learning rate schedule 'learning_rate_init': [0.01, 0.001], # Initial learning rate 'max_iter': [100, 200, 300], # Maximum number of iterations } # Create GridSearchCV with the MLPRegressor model and the hyperparameter grid grid_search = GridSearchCV(mlp_reg, param_grid, cv=5, scoring='neg_mean_squared_error') # Fit the GridSearchCV to the training data grid_search.fit(X_train, y_train) # Get the best MLPRegressor model from the grid search best_mlp_reg = grid_search.best_estimator_ # Print the best hyperparameters found print("Best Hyperparameters for MLPRegressor:") print(grid_search.best_params_) return best_mlp_reg def lasso_regression(self, X_train, y_train): # Create the LassoCV model lasso_reg = LassoCV(n_alphas=1000, max_iter=3000, random_state=0) # Define the hyperparameter grid to search param_grid = { 'normalize': [True, False], # Whether to normalize the features before fitting the model 'fit_intercept': [True, False] # Whether to calculate the intercept for this model } # Create GridSearchCV with the LassoCV model and the hyperparameter grid grid_search = GridSearchCV(lasso_reg, param_grid, cv=5, scoring='neg_mean_squared_error') # Fit the GridSearchCV to the training data grid_search.fit(X_train, y_train) # Get the best LassoCV model from the grid search best_lasso_reg = grid_search.best_estimator_ # Print the best hyperparameters found print("Best Hyperparameters for Lasso Regression:") print(grid_search.best_params_) return best_lasso_reg def ridge_regression(self, X_train, y_train): #Ridge regression ridge_reg = RidgeCV(gcv_mode='auto') # Define the hyperparameter grid to search (optional if you want to include other hyperparameters) param_grid = { 'normalize': [True, False], # Whether to normalize the features before fitting the model 'fit_intercept': [True, False] # Whether to calculate the intercept for this model } # Create GridSearchCV with the RidgeCV model and the hyperparameter grid (optional if you include the param_grid) grid_search = GridSearchCV(ridge_reg, param_grid, cv=5, scoring='neg_mean_squared_error') # Fit the GridSearchCV to the training data grid_search.fit(X_train, y_train) # Get the best RidgeCV model from the grid search best_ridge_reg = grid_search.best_estimator_ # Print the best hyperparameters found (optional if you included the param_grid) print("Best Hyperparameters for Ridge Regression:") print(grid_search.best_params_) return best_ridge_reg def adaboost_regression(self, X_train, y_train): #Adaboost regression # Create the AdaBoostRegressor model ada_reg = AdaBoostRegressor() # Define the hyperparameter grid to search param_grid = { 'n_estimators': [50, 100, 150], # Number of boosting stages (trees) to build 'learning_rate': [0.01, 0.1, 0.5], # Step size at each boosting iteration 'loss': ['linear', 'square', 'exponential'] # Loss function to use when updating weights } # Create GridSearchCV with the AdaBoostRegressor model and the hyperparameter grid grid_search = GridSearchCV(ada_reg, param_grid, cv=5, scoring='neg_mean_squared_error') # Fit the GridSearchCV to the training data grid_search.fit(X_train, y_train) # Get the best AdaBoostRegressor model from the grid search best_ada_reg = grid_search.best_estimator_ # Print the best hyperparameters found print("Best Hyperparameters for AdaBoostRegressor:") print(grid_search.best_params_) return best_ada_reg def knn_regression(self, X_train, y_train): #KNN regression # Create a KNeighborsRegressor model knn_reg = KNeighborsRegressor() # Define the hyperparameter grid to search param_grid = { 'n_neighbors': [3, 5, 7, 9], # Number of neighbors to use for regression 'weights': ['uniform', 'distance'], # Weight function used in prediction 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'] # Algorithm used to compute the nearest neighbors } # Create GridSearchCV with the KNeighborsRegressor model and the hyperparameter grid grid_search = GridSearchCV(knn_reg, param_grid, cv=5, scoring='neg_mean_squared_error') # Fit the GridSearchCV to the training data grid_search.fit(X_train, y_train) # Get the best KNeighborsRegressor model from the grid search best_knn_reg = grid_search.best_estimator_ # Print the best hyperparameters found print("Best Hyperparameters for KNeighborsRegressor:") print(grid_search.best_params_) return best_knn_reg #machine_learning.py import numpy as np from imblearn.over_sampling import SMOTE from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, StratifiedKFold from sklearn.preprocessing import StandardScaler import joblib from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score from sklearn.metrics import classification_report, f1_score, plot_confusion_matrix from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier from xgboost import XGBClassifier from sklearn.neural_network import MLPClassifier from sklearn.svm import SVC import os import joblib import pandas as pd from process_data import Process_Data class Machine_Learning: def __init__(self): self.obj_data = Process_Data() def oversampling_splitting(self, df): #Sets target column y = df["Summary"].values #Ensures y is of integer type y = y.astype(int) #Drops irrelevant columns X = df.drop(["Summary"], axis =1) # Check and convert data types X = X.astype(float) y = y.astype(int) sm = SMOTE(random_state=42) X,y = sm.fit_resample(X, y.ravel()) #Splits the data into training and testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2023, stratify=y) #Use Standard Scaler scaler = StandardScaler() X_train_stand = scaler.fit_transform(X_train) X_test_stand = scaler.transform(X_test) #Saves into pkl files joblib.dump(X_train_stand, 'X_train.pkl') joblib.dump(X_test_stand, 'X_test.pkl') joblib.dump(y_train, 'y_train.pkl') joblib.dump(y_test, 'y_test.pkl') def load_files(self): X_train = joblib.load('X_train.pkl') X_test = joblib.load('X_test.pkl') y_train = joblib.load('y_train.pkl') y_test = joblib.load('y_test.pkl') return X_train, X_test, y_train, y_test def predict_model(self, model, X, proba=False): if ~proba: y_pred = model.predict(X) else: y_pred_proba = model.predict_proba(X) y_pred = np.argmax(y_pred_proba, axis=1) return y_pred def run_model(self, name, model, X_train, X_test, y_train, y_test, proba=False): y_pred = self.predict_model(model, X_test, proba) accuracy = accuracy_score(y_test, y_pred) recall = recall_score(y_test, y_pred, average='weighted') precision = precision_score(y_test, y_pred, average='weighted') f1 = f1_score(y_test, y_pred, average='weighted') print(name) print('accuracy: ', accuracy) print('recall: ', recall) print('precision: ', precision) print('f1: ', f1) print(classification_report(y_test, y_pred)) return y_pred def logistic_regression(self, name, X_train, y_train): #Logistic Regression Classifier # Define the parameter grid for the grid search param_grid = { 'C': [0.01, 0.1, 1, 10], 'penalty': ['none', 'l2'], 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'saga'], } # Initialize the Logistic Regression model logreg = LogisticRegression(max_iter=100, random_state=2021) # Create GridSearchCV with the Logistic Regression model and the parameter grid grid_search = GridSearchCV(logreg, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best Logistic Regression model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'LR_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for LR:") print(grid_search.best_params_) return best_model def implement_LR(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/LR_Model.pkl" if os.path.exists(file_path): model = joblib.load('LR_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.logistic_regression(chosen, X_train, y_train) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_LR.csv") print("Training Logistic Regression done...") return model, y_pred def random_forest(self, name, X_train, y_train): #Random Forest Classifier # Define the parameter grid for the grid search param_grid = { 'n_estimators': [10, 20, 30], 'max_depth': [5, 10, 15, 20, 25], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4] } # Initialize the RandomForestClassifier model rf = RandomForestClassifier(random_state=2021) # Create GridSearchCV with the RandomForestClassifier model and the parameter grid grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best RandomForestClassifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'RF_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for RF:") print(grid_search.best_params_) return best_model def implement_RF(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/RF_Model.pkl" if os.path.exists(file_path): model = joblib.load('RF_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.random_forest(chosen, X_train, y_train) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_RF.csv") print("Training Random Forest done...") return model, y_pred def knearest_neigbors(self, name, X_train, y_train): #KNN Classifier # Define the parameter grid for the grid search param_grid = { 'n_neighbors': list(range(2, 10)) } # Initialize the KNN Classifier knn = KNeighborsClassifier() # Create GridSearchCV with the KNN model and the parameter grid grid_search = GridSearchCV(knn, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best KNN model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'KNN_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for KNN:") print(grid_search.best_params_) return best_model def implement_KNN(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/KNN_Model.pkl" if os.path.exists(file_path): model = joblib.load('KNN_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.knearest_neigbors(chosen, X_train, y_train) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_KNN.csv") print("Training KNN done...") return model, y_pred def decision_trees(self, name, X_train, y_train): # Initialize the DecisionTreeClassifier model dt_clf = DecisionTreeClassifier(random_state=2021) # Define the parameter grid for the grid search param_grid = { 'max_depth': np.arange(1, 11, 1), 'criterion': ['gini', 'entropy'], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], } # Create GridSearchCV with the DecisionTreeClassifier model and the parameter grid grid_search = GridSearchCV(dt_clf, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best DecisionTreeClassifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'DT_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for DT:") print(grid_search.best_params_) return best_model def implement_DT(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/DT_Model.pkl" if os.path.exists(file_path): model = joblib.load('DT_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.decision_trees(chosen, X_train, y_train) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_DT.csv") print("Training Decision Trees done...") return model, y_pred def gradient_boosting(self, name, X_train, y_train): #Gradient Boosting Classifier # Initialize the GradientBoostingClassifier model gbt = GradientBoostingClassifier(random_state=2021) # Define the parameter grid for the grid search param_grid = { 'n_estimators': [10, 20, 30], 'max_depth': [5, 10, 15], 'subsample': [0.6, 0.8, 1.0], 'max_features': [0.2, 0.4, 0.6, 0.8, 1.0], } # Create GridSearchCV with the GradientBoostingClassifier model and the parameter grid grid_search = GridSearchCV(gbt, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best GradientBoostingClassifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'GB_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for GB:") print(grid_search.best_params_) return best_model def implement_GB(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/GB_Model.pkl" if os.path.exists(file_path): model = joblib.load('GB_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.gradient_boosting(chosen, X_train, y_train) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_GB.csv") print("Training Gradient Boosting done...") return model, y_pred def extreme_gradient_boosting(self, name, X_train, y_train): # Define the parameter grid for the grid search param_grid = { 'n_estimators': [10, 20, 30], 'max_depth': [5, 10, 15], 'learning_rate': [0.01, 0.1, 0.2], 'subsample': [0.6, 0.8, 1.0], 'colsample_bytree': [0.6, 0.8, 1.0], } # Initialize the XGBoost classifier xgb = XGBClassifier(random_state=2021, use_label_encoder=False, eval_metric='mlogloss') # Create GridSearchCV with the XGBoost classifier and the parameter grid grid_search = GridSearchCV(xgb, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best XGBoost classifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'XGB_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for XGB:") print(grid_search.best_params_) return best_model def implement_XGB(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/XGB_Model.pkl" if os.path.exists(file_path): model = joblib.load('XGB_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.extreme_gradient_boosting(chosen, X_train, y_train) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_XGB.csv") print("Training Extreme Gradient Boosting done...") return model, y_pred def multi_layer_perceptron(self, name, X_train, y_train): # Define the parameter grid for the grid search param_grid = { 'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50), (100, 100)], 'activation': ['logistic', 'relu'], 'solver': ['adam', 'sgd'], 'alpha': [0.0001, 0.001, 0.01], 'learning_rate': ['constant', 'invscaling', 'adaptive'], } # Initialize the MLP Classifier mlp = MLPClassifier(random_state=2021) # Create GridSearchCV with the MLP Classifier and the parameter grid grid_search = GridSearchCV(mlp, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best MLP Classifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'MLP_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for MLP:") print(grid_search.best_params_) return best_model def implement_MLP(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/MLP_Model.pkl" if os.path.exists(file_path): model = joblib.load('MLP_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.multi_layer_perceptron(chosen, X_train, y_train) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_MLP.csv") print("Training Multi-Layer Perceptron done...") return model, y_pred def support_vector(self, name, X_train, y_train): #Support Vector Classifier # Define the parameter grid for the grid search param_grid = { 'C': [0.1, 1, 10], 'kernel': ['linear', 'poly', 'rbf'], 'gamma': ['scale', 'auto', 0.1, 1], } # Initialize the SVC model model_svc = SVC(random_state=2021, probability=True) # Create GridSearchCV with the SVC model and the parameter grid grid_search = GridSearchCV(model_svc, param_grid, cv=3, scoring='accuracy', n_jobs=-1, refit=True) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best MLP Classifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'SVC_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for SVC:") print(grid_search.best_params_) return best_model def implement_SVC(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/SVC_Model.pkl" if os.path.exists(file_path): model = joblib.load('SVC_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.support_vector(chosen, X_train, y_train) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_SVC.csv") print("Training Support Vector Classifier done...") return model, y_pred def adaboost_classifier(self, name, X_train, y_train): # Define the parameter grid for the grid search param_grid = { 'n_estimators': [10, 20, 30], 'learning_rate': [0.01, 0.1, 0.2], } # Initialize the AdaBoost classifier adaboost = AdaBoostClassifier(random_state=2021) # Create GridSearchCV with the AdaBoost classifier and the parameter grid grid_search = GridSearchCV(adaboost, param_grid, cv=3, scoring='accuracy', n_jobs=-1) # Train and perform grid search grid_search.fit(X_train, y_train) # Get the best AdaBoost Classifier model from the grid search best_model = grid_search.best_estimator_ #Saves model joblib.dump(best_model, 'ADA_Model.pkl') # Print the best hyperparameters found print(f"Best Hyperparameters for AdaBoost:") print(grid_search.best_params_) return best_model def implement_ADA(self, chosen, X_train, X_test, y_train, y_test): file_path = os.getcwd()+"/ADA_Model.pkl" if os.path.exists(file_path): model = joblib.load('ADA_Model.pkl') y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) else: model = self.adaboost_classifier(chosen, X_train, y_train) y_pred = self.run_model(chosen, model, X_train, X_test, y_train, y_test, proba=True) #Saves result into excel file self.obj_data.save_result(y_test, y_pred, "results_ADA.csv") print("Training AdaBoost done...") return model, y_pred
No comments:
Post a Comment