Handling Missing Values with KNN Imputer

Missing data is a common issue in real-world datasets, and handling it effectively is crucial for building accurate machine learning models. One powerful technique for imputing missing values is the K-Nearest Neighbors (KNN) Imputer. This method replaces missing values based on the values of their nearest neighbors, making it more effective than traditional imputation techniques like mean or median imputation.

What is KNN Imputer?

The KNN Imputer works by finding the k nearest neighbors of a sample with missing values and imputing the missing values using the average (or weighted average) of the corresponding feature values from the nearest neighbors.

Why Use KNN Imputer?

Unlike mean/median imputation, KNN Imputer maintains the relationship between features.
It provides better estimates compared to filling in missing values with static values.
Unlike dropping rows with missing values, it retains all available information.

Implementing KNN Imputer in Python

Let’s dive into an example where we use KNN Imputer to handle missing values in a dataset.

Step 1: Import Required Libraries

import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

Step 2: Create a Sample Dataset

We will create a dataset containing marks in four subjects: Math, Physics, Chemistry, and English. Some values are missing (NaN), which we will impute using KNN Imputer.

data = {
    'Math': [61, 43, 36, 85, 55, 20],
    'Physics': [56, 49, 50, 80, 55, 25],
    'Chemistry': [np.nan, 55, 49, 69, 61, 29],
    'English': [65, 80, np.nan, 70, 81, 40]
}

df = pd.DataFrame(data)
print(df)

Output:

   Math  Physics  Chemistry  English
0    61      56       NaN      65
1    43      49      55.0      80
2    36      50      49.0      NaN
3    85      80      69.0      70
4    55      55      61.0      81
5    20      25      29.0      40

Step 3: Apply KNN Imputer

We now apply KNN Imputer with n_neighbors=2, meaning it will use the two closest neighbors to fill in the missing values.

imputer = KNNImputer(n_neighbors=2)
imputed_df = imputer.fit_transform(df)

# Convert back to a DataFrame
imputed_df = pd.DataFrame(imputed_df, columns=df.columns)

print(imputed_df)

Output:

   Math  Physics  Chemistry  English
0    61      56       58       65
1    43      49      55.0      80
2    36      50      49.0      80.5
3    85      80      69.0      70
4    55      55      61.0      81
5    20      25      29.0      40

Key Parameters of KNN Imputer

The KNNImputer class in sklearn.impute provides the following key parameters:

n_neighbors: Specifies the number of nearest neighbors to consider for imputation (default is 5).
weights: Determines how neighbors contribute to the imputed value (uniform or distance).
missing_values: Defines what value is considered missing (default is np.nan).

Example with weighted imputation:

imputer = KNNImputer(n_neighbors=3, weights='distance')
imputed_df = imputer.fit_transform(df)

When to Use KNN Imputer?

Use KNN Imputer when:

Your dataset has missing values that depend on other features.
The dataset is not too large, as KNN can be computationally expensive.
The missing values are not completely random but follow some patterns.

Avoid using KNN Imputer when:

Your dataset is too large (KNN can be slow for large datasets).
The missing values are completely random.

Conclusion

KNN Imputer is a powerful technique for handling missing data while maintaining the integrity of relationships between features. By leveraging the patterns in your dataset, it provides more reliable imputations than traditional methods like mean or median filling.

What is KNN Imputer?

Why Use KNN Imputer?

Implementing KNN Imputer in Python

Step 1: Import Required Libraries

Step 2: Create a Sample Dataset

Step 3: Apply KNN Imputer

Key Parameters of KNN Imputer

When to Use KNN Imputer?

Conclusion

Newsletter Updates

Leave a ReplyCancel Reply

Related Posts

Categorize ML into Batch Learning and Online Learning

🔥AI / ML Engineer Roadmap 🔥