categoricals

Sklearn-compatible transformer to encode non-numeric columns as pandas categorical features.

Classes

SeriesEncoder

Categorical encoding of the values for a pandas Series.

Encoder

Sklearn-compatible transformer to encode non-numeric columns as pandas categorical features.

Functions

infer_categoricals(→ list[str])

Identify columns that should be coded as categorical.

Module Contents

class categoricals.SeriesEncoder[source]

Categorical encoding of the values for a pandas Series.

categories[source]

The list of distinct non-null categories.

categories: list[Any][source]
classmethod fit(series: pandas.Series) SeriesEncoder[source]

Learn categorical codes of a data series.

Parameters:

series – The pandas Series to fit.

Returns:

A SeriesEncoder object with the unique categories and null presence.

__call__(series: pandas.Series) pandas.Series[source]

Encode a series as categorical.

This encoder maintains a distinction between null values and “never-before-seen” values: - Pandas maps null values to -1. - Pandas also maps any never-before-seen values to -1, which loses information. Instead, we identify

such values and postpend them to the list of categories, so that they are encoded as new categories.

This distinction can matter for certain ML algorithms, such as XGBoost. See also https://github.com/microsoft/LightGBM/issues/6908.

categoricals.infer_categoricals(df: pandas.DataFrame) list[str][source]

Identify columns that should be coded as categorical.

class categoricals.Encoder(specified_columns: list[str] | None = None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Sklearn-compatible transformer to encode non-numeric columns as pandas categorical features.

This stores category mappings during fit and applies them during transform, ensuring consistency in the value-to-code mapping between training and scoring.

Initialize the Encoder.

Parameters:

specified_columns – Optional list of column names to be encoded as pandas categorical. If not specified, the fit method will automatically detect non-numeric columns and treat them all as categorical.

specified_columns = None[source]
encoders: dict[str, SeriesEncoder][source]
fit(X: pandas.DataFrame, y: Any = None) Encoder[source]

Fit the encoder to the data.

Learns the unique categories for each specified categorical column and saves the category codes.

Parameters:
  • X – The data to fit the encoder on.

  • y – Ignored, present for API consistency.

Raises:

ValueError – If the encoder has already been fitted.

transform(X: pandas.DataFrame) pandas.DataFrame[source]

Apply the category encodings to new data.

Parameters:

X – The data to transform

Returns: The transformed data with consistent categorical encodings.