Missing data is a common issue in real-world datasets, and handling it effectively is crucial for building accurate machine learning models. One powerful technique for imputing missing values is the K-Nearest Neighbors (KNN) Imputer. This method replaces missing values based on the values of their nearest neighbors, making it more effective than traditional imputation techniques like mean or median imputation.
What is KNN Imputer?
The KNN Imputer works by finding the k nearest neighbors of a sample with missing values and imputing the missing values using the average (or weighted average) of the corresponding feature values from the nearest neighbors.
Why Use KNN Imputer?
- Unlike mean/median imputation, KNN Imputer maintains the relationship between features.
- It provides better estimates compared to filling in missing values with static values.
- Unlike dropping rows with missing values, it retains all available information.
Implementing KNN Imputer in Python
Let’s dive into an example where we use KNN Imputer to handle missing values in a dataset.
Step 1: Import Required Libraries
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
Step 2: Create a Sample Dataset
We will create a dataset containing marks in four subjects: Math, Physics, Chemistry, and English. Some values are missing (NaN), which we will impute using KNN Imputer.
data = {
'Math': [61, 43, 36, 85, 55, 20],
'Physics': [56, 49, 50, 80, 55, 25],
'Chemistry': [np.nan, 55, 49, 69, 61, 29],
'English': [65, 80, np.nan, 70, 81, 40]
}
df = pd.DataFrame(data)
print(df)
Output:
Math Physics Chemistry English
0 61 56 NaN 65
1 43 49 55.0 80
2 36 50 49.0 NaN
3 85 80 69.0 70
4 55 55 61.0 81
5 20 25 29.0 40
Step 3: Apply KNN Imputer
We now apply KNN Imputer with n_neighbors=2, meaning it will use the two closest neighbors to fill in the missing values.
imputer = KNNImputer(n_neighbors=2)
imputed_df = imputer.fit_transform(df)
# Convert back to a DataFrame
imputed_df = pd.DataFrame(imputed_df, columns=df.columns)
print(imputed_df)
Output:
Math Physics Chemistry English
0 61 56 58 65
1 43 49 55.0 80
2 36 50 49.0 80.5
3 85 80 69.0 70
4 55 55 61.0 81
5 20 25 29.0 40
Key Parameters of KNN Imputer
The KNNImputer class in sklearn.impute provides the following key parameters:
- n_neighbors: Specifies the number of nearest neighbors to consider for imputation (default is 5).
- weights: Determines how neighbors contribute to the imputed value (uniform or distance).
- missing_values: Defines what value is considered missing (default is np.nan).
Example with weighted imputation:
imputer = KNNImputer(n_neighbors=3, weights='distance')
imputed_df = imputer.fit_transform(df)
When to Use KNN Imputer?
Use KNN Imputer when:
- Your dataset has missing values that depend on other features.
- The dataset is not too large, as KNN can be computationally expensive.
- The missing values are not completely random but follow some patterns.
Avoid using KNN Imputer when:
- Your dataset is too large (KNN can be slow for large datasets).
- The missing values are completely random.
Conclusion
KNN Imputer is a powerful technique for handling missing data while maintaining the integrity of relationships between features. By leveraging the patterns in your dataset, it provides more reliable imputations than traditional methods like mean or median filling.