Pandas fillna for Categorical Columns
From time to time, our favorite machine learning modeling libraries will throw errors complaining about NaN
s when didnt scrub our data clean enough.
The most instictive thing to try is
df.fillna(inplace=True)
However,pandas
will throw a ValueError: fill value must be in categories
for categorical columns.
Pandas throws this error because internally it associate a list of categories to each individual categorical columns.
There are two ways to solve this error.
Conversion Central
We can convert the columns to a generic object type or string type. Then we can fill the NaN
with a value. For example:
df['categorical_column'] = df['categorical_column'].astype(object)
df['categorical_column'].fillna('Null', inplace=True)
Works, but not exactly natural. Which brings us to:
Proper way to fill NaNs for categorical columns
The error is thrown because the fill value was not included in the list of categories. Thus, we can
- Use
pd.Series.cat.add_categories
to add the fill value to the catogires list. - Fill with the newly added category.
For example, assuming we want to fill np.NaN
with string value Null
:
df['categorical_column'] = df['categorical_column'].cat.add_categories('Null')
df['categorical_column'].fillna('Null', inplace=True)
Or, we can chain both in a one liner:
df['categorical_column`] = df['categorical_column'].cat.add_categories('Null').fillna('Null')
To cite this content, please use:
@article{
leehanchung,
author = {Lee, Hanchung},
title = {Pandas Fill NaN for Categorical Columns},
year = {2021},
howpublished = {\url{https://leehanchung.github.io/}},
url = {https://leehanchung.github.io/2021-08-09-pandas-fillna/}
}