Parsing data with Processor

We still need to parse loaded data in order to train and evaluate our SVM classifier.

We can define several plug-and-play Processor to

  • Process input data

  • Process classification labels

  • Process data for the classifier

Input data

To process input data, we rely on tf-idf processing since we are dealing with a SVM classifier.

We define a TfIdfProcessor as follows

class TfIdfProcessor(Component):

    def __init__(
            self,
            **kwargs
    ):
        self.vectorizer = TfidfVectorizer(**kwargs)

    def process(
            self,
            data: Optional[pd.DataFrame],
            is_training_data: bool = False,
    ) -> Optional[Any]:
        if data is None:
            return data

        if is_training_data:
            self.vectorizer.fit(data.x.values)

        return self.vectorizer.transform(data.x.values)

The TfIdfProcessor has an internal TfidfVectorizer from sklearn. The vectorizer is used in process() to convert textual input data into numerical format.

We define a corresponding TfIdfProcessorConfig with minimal view (for simplicity) of the vectorizer.

class TfIdfProcessorConfig(Configuration):

    @classmethod
    @register_method(name='processor',
                     tags={'tf-idf'},
                     namespace='examples',
                     component_class=TfIdfProcessor)
    def default(
            cls
    ):
        config = super().default()

        config.add(name='ngram_range',
                   value=(1, 1),
                   type_hint=Any,
                   description='Vectorizer ngram_range hyper-parameter')

        return config

We register the TfIdfProcessorConfig via RegistrationKey (name=processor, tags={'tf-idf'}, namespace=examples) and bind it to TfIdfProcessor.

Classification Labels

To process classification labels, we rely on one-hot encoding via LabelEncoder from sklearn.

We define a LabelProcessor as follows

class LabelProcessor(Component):

    def __init__(
            self
    ):
        self.label_encoder = LabelEncoder()

    def process(
            self,
            data: Optional[pd.DataFrame],
            is_training_data: bool = False
    ) -> Optional[Any]:
        if data is None:
            return data

        labels = data.y.values
        if is_training_data:
            self.label_encoder.fit(labels)

        return self.label_encoder.transform(labels)

The LabelProcessor doesn’t require any specific configuration since it has no hyper-parameters.

Thus, we can bind it to Configuration.

@register
def register_processors():
    Registry.register_configuration(config_class=Configuration,
                                    component_class=LabelProcessor,
                                    name='processor',
                                    tags={'label'},
                                    namespace='examples')

Next!

That’s it! We have defined processors to parse input data so that it can be digested by our SVM classifier.

Next, we define the SVM classifier as a custom Model component.