4. Data Clustering

Visualizing Clustered Programs in 2021: t-SNE with HDBSCAN Clustering and Program Efficiency

Untitled

✍️What Has Been Done and What Lies Ahead?

Welcome, everyone, to the thrilling fourth stop on our Blue Box Program Data Analysis journey! So far, we've embarked on an exciting adventure of Exploratory Data Analysis (EDA) and unveiled the intriguing correlations between our program features (parameters). If you missed the previous session – the longest and most informative one – don't worry; I've condensed the key insights below.

In this data-driven expedition, we've unearthed three crucial discoveries:

🌎 Uniform Distribution Across Regions: We noticed a uniform distribution of resources across regions, including Toronto, Ottawa, Vancouver, and Montreal. While it might seem equitable, it has led to inefficiencies in these areas.
💰 High Operations, Low Efficiency: These same regions have the highest program operations and waste collection rates but exhibit lower efficiency. This situation translates into a significant financial burden that could potentially threaten the program's sustainability.
📈 Cost vs. Volume Relationship: We explored the relationship between program costs and collected materials. It appears to follow a power function with a power value between 0 and 1, indicating a potential volume shrinkage phenomenon beyond a certain point. We need to pinpoint the optimal values for each region to avoid this issue.

Building on our previous session's insights, we're now taking a unique approach. We'll dive into feature analysis to decipher what sets apart regions that achieve both higher material collection and household services with greater efficiency from those that struggle.

But before we venture into feature analysis, we're making a pit stop for an exciting and critical phase: Data Clustering. Our goal is to group similar programs together, enhancing the accuracy of our feature analysis and uncovering hidden patterns.

So, fasten your seatbelts, as we explore the fascinating world of Data Clustering! We're utilizing HDBSCAN, a highly accurate clustering method, for our data, and we'll visualize the clusters in two dimensions using t-SNE. Get ready for an electrifying session that promises to unveil insights like never before! 🚀📊

🧑‍🔧Model Improvement

Before we dive into clustering, let's discuss some valuable enhancements we've made to our model for increased accuracy and effectiveness:

📊 Streamlined Costs: To combat multicollinearity, we've combined various cost-related features into a single "Operation Cost." This consolidation simplifies our model, addressing high correlations among these costs while preserving their impact on the target variable.
🔄 Interaction Power: We've introduced an "Interaction of Households Serviced and Operation Cost" feature. This feature's magic lies in its multiplication of the number of households serviced by the operation cost, allowing us to grasp both scale and cost factors simultaneously.
🕰️ Time Travel: Acknowledging the lingering impact of past years, we've added a "Previous Year Target" feature. This feature accounts for the influence of previous investments, costs, program scale, and households covered on the current year's collected materials.