From time to time, our favorite machine learning modeling libraries will throw errors complaining about NaNs when didnt scrub our data clean enough.

The most instictive thing to try is

df.fillna(inplace=True)

However,pandas will throw a ValueError: fill value must be in categories for categorical columns.

Pandas throws this error because internally it associate a list of categories to each individual categorical columns.

There are two ways to solve this error.

Conversion Central

We can convert the columns to a generic object type or string type. Then we can fill the NaN with a value. For example:

df['categorical_column'] = df['categorical_column'].astype(object)
df['categorical_column'].fillna('Null', inplace=True)

Works, but not exactly natural. Which brings us to:

Proper way to fill NaNs for categorical columns

The error is thrown because the fill value was not included in the list of categories. Thus, we can

  1. Use pd.Series.cat.add_categories to add the fill value to the catogires list.
  2. Fill with the newly added category.

For example, assuming we want to fill np.NaN with string value Null:

df['categorical_column'] = df['categorical_column'].cat.add_categories('Null')
df['categorical_column'].fillna('Null', inplace=True)

Or, we can chain both in a one liner:

df['categorical_column`] = df['categorical_column'].cat.add_categories('Null').fillna('Null')

To cite this content, please use:

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Pandas Fill NaN for Categorical Columns},
    year = {2021},
    howpublished = {\url{https://leehanchung.github.io/}},
    url = {https://leehanchung.github.io/2021-08-09-pandas-fillna/}
}