
๐๐ After delving deep into clustering and t-SNE in the previous chapter, initially, I didn't plan to create a separate section solely for hyperparameter tuning. It was intended to be an integral part of our Global Feature Analysis. However, as I progressed, I realized the immense significance of fine-tuning a machine-learning model. It's a critical component that essentially guarantees the accuracy of our analysis results.
๐ As you journey through this chapter, you have the option to skip ahead to this section, but I highly recommend you join us in this exciting exploration. Witness how we fine-tune parameters to optimize our model's performance.
๐ This framework seamlessly navigates multicollinearity, deploys Randomized Search and Grid Search in tandem to fine-tune hyperparameters, and ultimately ensures the stability of our results through OOB Error vs. Number of Estimators. It's an essential part of our data analysis adventure!
It's time to supercharge our model's performance by making savvy feature choices. We've pruned away highly correlated features, retaining only the most influential ones. Now, let's recalibrate and calculate the Variable Inflation Factor (VIF) to fine-tune our model. ๐
# Calculate Variance Inflation Factor (VIF)
def calculate_vif(data_frame):
"""
Calculate the Variance Inflation Factor (VIF) for each variable in the DataFrame.
Parameters:
- data_frame (DataFrame): The DataFrame containing numerical variables.
Returns:
- vif_data_ (DataFrame): DataFrame with columns 'Variable' and 'VIF' representing VIF values for each variable.
"""
vif_data_ = pd.DataFrame()
vif_data_["Variable"] = data_frame.columns
vif_data_["VIF"] = [variance_inflation_factor(data_frame.values, i) for i in range(data_frame.shape[1])]
return vif_data_
# Iterate through each year to improve the model, handle multicollinearity, prepare training and test data,
# perform hyperparameter tuning using Randomized Search and Grid Search, and visualize OOB Error vs Estimators curve.
# Iterate through each year
for year in unique_years:
if year < 2022: # Condition to limit the loop
# Section 1: Model Improvement and Handling Multicollinearity๐
# Remove irrelevant columns and handle multicollinearity
recycle_material_filtered = recycle_material.drop(columns=['Year', 'Cluster_Probabilities', 'Cluster_Labels',
'TOTAL Reported and/or Calculated Marketed Tonnes',
'Quadrant', 'Program Code'])
# Handle missing values and calculate VIF
numerical_data = recycle_material_filtered.select_dtypes(include=[np.number])
numerical_data = numerical_data.replace([np.inf, -np.inf], np.nan)
numerical_data.dropna(inplace=True)
vif_data = calculate_vif(numerical_data)
print(vif_data)
๐ VIF Calculation Results: Our trusty tool, the Variance Inflation Factor (VIF), is here to help us navigate the intricate world of multicollinearity. We've meticulously selected features with VIF values less than 10, guaranteeing our model's precision. But, here's the twist: we've identified three key features with elevated VIF scores. ๐๐ฏ
1๏ธโฃ Total Households Serviced: ๐ High VIF(= 147.55), High Impact 2๏ธโฃ Operation Cost: ๐ High VIF(= 67.84), High Importance 3๏ธโฃ Previous_Target: ๐ High VIF(= 114.31), Must-Keep
These three features play a pivotal role in our model due to the unique nature of our programs.
However, their high VIF values indicate potential bias during training. But fret not! We have a solution.
๐ณ Enter Random Forest ๐ณ