6. Feature Analysis

Feature Selection Analysis: Prioritizing Important Features by Frequency

✍️What Has Been Done and What Lies Ahead?

Hello again! I'm thrilled to have you here for the next leg of our journey. The previous section might have been a tad too technical 😅, but fear not; we're about to dive into something truly amazing.

In our last encounter, we fine-tuned our model and tackled multicollinearity head-on, adopting a two-step approach to optimize our 🌳RandomForest Model. Now, the spotlight is on harnessing the power of hyperparameters for feature analysis.

This is no ordinary feature selection 🦾; it's unlike anything you've seen before. You will see a two-layer feature selection. We'll start by using Feature Importance + Random Features, which adds a random twin for each feature and evaluates their importance by comparing them with their random counterparts. But that's just the beginning— in the next step, you'll witness the sheer brilliance of Boruta Feature Importance, which takes it up a notch by using shuffling and shadow twins for each feature, opening up a world of possibilities in feature selection.

So, without further ado, let's get started! 💫 🏃‍♂️

🤚But wait, I understand that you might be eager for more insights and analysis about the program and data. If that's the case, you have the option to skip this section and head straight to the next one, where we delve into local feature analysis in the Blue Box program. The choice is yours! 🚀

🎲Feature Importance + Random Features (first layer)

In the realm of data science, precision is key, and every feature counts. Our approach cleverly combines the power of feature importance with an ingenious twist. This innovative method tailors feature selection by generating random counterparts to the existing features based on their data types.🚀📊

# Generate random features based on the data type of existing columns in the DataFrame X
def generate_random_features(X):
    """
    Generate Random Features for Given DataFrame.

    This function generates random features based on the data type of existing columns in the DataFrame X.

    Parameters:
    X (DataFrame): Input DataFrame containing the original features.

    Returns:
    DataFrame: A new DataFrame with random features added.
    """
    random_features = pd.DataFrame()
    for column in X.columns:
        if X[column].dtype == bool:
            random_feature = np.random.choice([0, 1], size=len(X))
        elif X[column].dtype == float:
            random_feature = np.random.uniform(0, 1, size=len(X))
        else:
            random_feature = np.random.randint(0, 100, size=len(X))
        random_features[column + "_random"] = random_feature

    combine = pd.concat([X, random_features], axis=1)

    return pd.concat([X, random_features], axis=1)

But here's where it gets truly brilliant: we don't rely on a single evaluation. The power of iteration in machine learning is undeniable, and it often holds the key to more reliable results. Just as 20 trials offer more insights than a single one, and 100 trials surpass 20 in reliability, the same principle applies to feature importance analysis. Instead, we take feature importance to new heights by assessing it multiple times. This approach accounts for the inherent variability driven by both randomness and statistical factors. The outcome? A robust and dependable feature selection method that amplifies the value of your data analysis.

Join us on this journey where data meets innovation, and where every feature is meticulously chosen for its significance." 📈🔍

# Define a function to evaluate feature importance based on a specified model
def evaluate_feature_importance_with_frequency(X_feature, y_target, model, feature_importanceModel, num_iterations=20):
    """
    Feature Importance Evaluation with Frequency

    This function evaluates feature importance multiple times (controlled by `num_iterations`)
    to account for variability caused by randomness and statistics. It calculates the number of times each
    feature passes the importance threshold and considers them as important features, enhancing the robustness
    of feature selection.

    Parameters:
    - X_feature (DataFrame): Input DataFrame containing the features.
    - y_target (Series): Target variable.
    - model: Machine learning model for feature importance evaluation.
    - feature_importanceModel: A function for feature importance evaluation.
    - num_iterations (int): Number of iterations to evaluate feature importance.

    Returns:
    - list: A list of selected features based on frequency of passing the importance threshold.
    """

    selected_features_frequency = []

    # Loop through iterations for feature importance evaluation
    for _ in range(num_iterations):
        selected_features_ = feature_importanceModel(X_feature, y_target, model)
        selected_features_frequency.append(selected_features_)

    # Create a frequency dictionary to count the frequency of each selected feature
    feature_frequency = {}
    for iteration in selected_features_frequency:
        for feature in iteration:
            if feature in feature_frequency:
                feature_frequency[feature] += 1
            else:
                feature_frequency[feature] = 1

    # Sort the features by frequency
    sorted_features = sorted(feature_frequency.items(), key=lambda x: x[1], reverse=True)

    # Separate features and their frequencies
    features, frequencies = zip(*sorted_features)

    # Calculate and plot the probability mass function using binomial distribution
    trials = num_iterations
    pmf = [sp.stats.binom.pmf(x, trials, 0.5) for x in range(trials + 1)]

    # Plot the probability mass function
    plt.plot(list(range(0, trials + 1)), pmf, color="black")

    # Visualize hits for each feature
    displayed_features = set()  # To keep track of displayed features
    label_shift = 0.2  # Adjust this value for spacing between labels
    vertical_spacing = 0.015  # Adjust this value for vertical spacing between labels

    for feature, frequency in zip(features, frequencies):
        color = 'green' if frequency > num_iterations / 2 else 'red'
        x_position = frequency - 1.5
        y_position = 0.002

        # Ensure the labels don't overlap vertically
        while y_position in [y for _, y in displayed_features]:
            y_position += vertical_spacing

        # Add vertical line and label for each feature
        pyplot.axvline(x_position, color=color)
        pyplot.text(x_position, y_position, feature)
        displayed_features.add((feature, y_position))  # Keep track of displayed features and their positions

    # Create a legend
    legend_labels = ['Highly Significant', 'Not Significant']
    legend_colors = ['green', 'red']
    legend_patches = [pyplot.Line2D([0], [0], color=color, label=label) for color, label in zip(legend_colors, legend_labels)]
    pyplot.legend(handles=legend_patches, loc='upper right')

    # Show the plot
    pyplot.show()

    return features