How to build custom transforms for your scikit-learn pipelines

Bechir Trabelsi
3 min readOct 24, 2022

--

A machine learning pipeline is an end-to-end construct that orchestrates the flow of data into, and output from, a machine learning model (or set of multiple models). It includes raw data input, features, outputs, the machine learning model and model parameters, and prediction outputs.The Scikit-learn pipeline is a tool that chains all these steps of the workflow together for a more streamlined procedure, you can think of it as a sequence of data processing mechanisms.

The steps of a scikit-learn pipeline can include many predefined modules that transform our data, like standard scalers, samplers, dimenstion reduction tools like PC, and my other kind of data transformation tools.

However, sometimes we find ourselves in situations where there is no predefined module to satisify the kind of data transformation we would like to apply. For this I’ll be show few ways we can construct these transforms and integrate them into our pipeline.

1. Using Function Transformers

Using this methode we can transform any function that does any kind of transformation into a pipeline step transorm. here is how we do it :

1-first step is to define our function,

here i’ll be working with a simple function that transforms the input data into a numpy array, this is useful for example if you’re trying to deploy a model on google cloud platform (GCP) vertex AI kubeflow pipelines and a batch prediction job , and where the input data comes from bigQuery directly and you’re using Xgboost as a model, the problem here is that the input data from bigQuery is automatically passed as a list row by row wheras Xgboost wont accept a list as input data, hence the need to transform it before feeding it to the model.

def data_to_numpy(input_df):
input_df = np.array(input_df)
return(input_df.reshape(1,-1))

2- Next we use FunctionTransformers module from sklearn

from sklearn.preprocessing import FunctionTransformer 
transformer_numpy = FunctionTransformer(
lambda X: data_to_numpy(X)
)

3 — Finally we use ColumnTransformer to transform it into a pipeline transform , where cols is the list of cols indices in our dataframe we would like to apply our transform to

from sklearn.compose import ColumnTransformer
to_numpy_step = ColumnTransformer(
transformers=[('convert_to_numpy', transformer_numpy,cols)], n_jobs=-1)

Now we can integrate this into our pipeline

model = XGBClassifier(n_estimators = 1500, max_depth = 7, reg_lambda = 0.5 ,alpha = 0.5, n_jobs = 32)
clf_pipeline = Pipeline(steps = ('convert input to numpy array',to_numpy),
('classifier', model))

2. Using a class

Using the first method can sometime create conflicts when trying to save and load your model using pickle, thus i recommend using the following method:

1- Define a class as follows:

class DataframeNumpyTransformer():    def __init__(self, transformer):      self.transformer =  transformer    def transform(self, df, **transform_params):        return self.transformer(df)     def fit(self, X, y=None, **fit_params):          return self

2- We will use the same function data_to_numpy from section 1, but instead of Pipeline module, we’ll use make_pipline also avalaible is sklearn

from sklearn.pipeline import make_pipeline
clf_pipeline = make_pipeline(DataframeNumpyTransformer(data_to_numpy),model)

And there you go, your pipeline is set to go,

I hope this short article helps you develop your ML workflows in a more efficient and easier way.

--

--