Integrating Word2Vec with Scikit-Learn Pipelines¶
In this post, we'll explore how to integrate a custom Word2Vec
transformer with scikit-learn
pipelines. This allows us to leverage the power of Word2Vec embeddings in a machine learning workflow.
Introduction¶
Word2Vec is a popular technique for natural language processing (NLP) that transforms words into continuous vector representations. These vectors capture semantic relationships between words, making them useful for various NLP tasks. However, integrating Word2Vec with scikit-learn
pipelines requires a custom transformer. Let's walk through the process step-by-step.
import pandas as pd
from gensim.models import Word2Vec
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
# Sample data
corpus = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]
labels = [1, 0, 1, 1]
# Custom Word2VecTransformer
class Word2VecTransformer(BaseEstimator, TransformerMixin):
def __init__(self, vector_size=100, window=5, min_count=1, workers=4):
self.vector_size = vector_size
self.window = window
self.min_count = min_count
self.workers = workers
def fit(self, X, y=None):
tokenized_sentences = [sentence.split() for sentence in X]
self.model = Word2Vec(sentences=tokenized_sentences, vector_size=self.vector_size,
window=self.window, min_count=self.min_count, workers=self.workers)
return self
def transform(self, X):
tokenized_sentences = [sentence.split() for sentence in X]
return np.array([
np.mean([self.model.wv[word] for word in sentence if word in self.model.wv]
or [np.zeros(self.vector_size)], axis=0)
for sentence in tokenized_sentences
])
# Define the pipeline
pipeline = Pipeline([
('word2vec', Word2VecTransformer(vector_size=100, window=5, min_count=1, workers=4)),
('clf', LogisticRegression())
])
# Fit the pipeline
pipeline.fit(corpus, labels)
# Make predictions
predictions = pipeline.predict(["This is a new document."])
print(predictions)
Explanation¶
- Import Libraries: We import necessary libraries for data manipulation (
pandas
), Word2Vec model (gensim
), and creating a pipeline (sklearn
). - Sample Data: We create a sample corpus and corresponding labels.
- Custom Transformer: We define a custom
Word2VecTransformer
class that fits a Word2Vec model and transforms documents into vector representations. - Pipeline Definition: We define a pipeline with
TfidfVectorizer
, our customWord2VecTransformer
, andLogisticRegression
. - Training and Prediction: We fit the pipeline on the sample data and make predictions on a new document.
Conclusion¶
By creating a custom Word2Vec
transformer, we can seamlessly integrate Word2Vec embeddings into scikit-learn
pipelines. This approach allows us to leverage the power of Word2Vec in a structured machine learning workflow.
Last update 2025-02-03