Parsing data with Processor
We still need to parse loaded data in order to train and evaluate our SVM classifier.
We can define several plug-and-play Processor
to
Process input data
Process classification labels
Process data for the classifier
Input data
To process input data, we rely on tf-idf processing since we are dealing with a SVM classifier.
We define a TfIdfProcessor
as follows
class TfIdfProcessor(Component):
def __init__(
self,
**kwargs
):
self.vectorizer = TfidfVectorizer(**kwargs)
def process(
self,
data: Optional[pd.DataFrame],
is_training_data: bool = False,
) -> Optional[Any]:
if data is None:
return data
if is_training_data:
self.vectorizer.fit(data.x.values)
return self.vectorizer.transform(data.x.values)
The TfIdfProcessor
has an internal TfidfVectorizer
from sklearn. The vectorizer is used in process()
to convert textual input data into numerical format.
We define a corresponding TfIdfProcessorConfig
with minimal view (for simplicity) of the vectorizer.
class TfIdfProcessorConfig(Configuration):
@classmethod
@register_method(name='processor',
tags={'tf-idf'},
namespace='examples',
component_class=TfIdfProcessor)
def default(
cls
):
config = super().default()
config.add(name='ngram_range',
value=(1, 1),
type_hint=Any,
description='Vectorizer ngram_range hyper-parameter')
return config
We register the TfIdfProcessorConfig
via RegistrationKey
(name=processor
, tags={'tf-idf'}
, namespace=examples
) and bind it to TfIdfProcessor
.
Classification Labels
To process classification labels, we rely on one-hot encoding via LabelEncoder
from sklearn.
We define a LabelProcessor
as follows
class LabelProcessor(Component):
def __init__(
self
):
self.label_encoder = LabelEncoder()
def process(
self,
data: Optional[pd.DataFrame],
is_training_data: bool = False
) -> Optional[Any]:
if data is None:
return data
labels = data.y.values
if is_training_data:
self.label_encoder.fit(labels)
return self.label_encoder.transform(labels)
The LabelProcessor
doesn’t require any specific configuration since it has no hyper-parameters.
Thus, we can bind it to Configuration
.
@register
def register_processors():
Registry.register_configuration(config_class=Configuration,
component_class=LabelProcessor,
name='processor',
tags={'label'},
namespace='examples')
Next!
That’s it! We have defined processors to parse input data so that it can be digested by our SVM classifier.
Next, we define the SVM classifier as a custom Model
component.