Skip to main content

Text Classification with Embed

This notebook shows how to build a classifiers using Cohere's embeddings. You can find the code in the notebook and colab.

first we embed the text in the dataset, then we use that to train a classifier

The example classification task here will be sentiment analysis of film reviews. We'll train a simple classifier to detect whether a film review is negative (class 0) or positive (class 1).

We'll go through the following steps:

  1. Get the dataset
  2. Get the embeddings of the reviews (for both the training set and the test set).
  3. Train a classifier using the training set
  4. Evaluate the performance of the classifier on the testing set

1. Get the dataset#

import pandas as pd
import cohere
from tqdm import tqdm
# Get the SST2 training and test sets
df_train = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
df_test = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/test.tsv', delimiter='\t', header=None)
# Let's glance at the dataset
df_train.head()
print(f"Review #1 text: {df_train.iloc[0, 0]}")
print(f"Review #1 class: {df_train.iloc[0, 1]}")
print(f"Review #2 text: {df_train.iloc[1, 0]}")
print(f"Review #2 class: {df_train.iloc[1, 1]}")
Review #1 text: a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
Review #1 class: 1
Review #2 text: apparently reassembled from the cutting room floor of any given daytime soap
Review #2 class: 0

We'll only use a subset of the training and testing datasets in this example. We'll only use 100 examples since this is a toy example. You'll want to increase the number to get better performance and evaluation.

n_train_samples = 300 # Increase for better performance (e.g. 500)
n_test_samples = 100 # increase for better evaluation (e.g. 500)
# Sample from the dataset
train = df_train.sample(n_train_samples)
test = df_test.sample(n_test_samples)
sentences_train = list(train.iloc[:,0].values)
sentences_test = list(test.iloc[:,0].values)
labels_train = list(train.iloc[:,1].values)
labels_test = list(test.iloc[:,1].values)

2. Get the embeddings of the reviews#

We're now ready to retrieve the embeddings from the API

# embed sentences from both train and test sets
embeddings_train = embedder.batch_embed(list(sentences_train))
embeddings_test = embedder.batch_embed(list(sentences_test))

We now have two sets of embeddings, embeddings_train contains the embeddings of the training sentences while embeddings_test contains the embeddings of the testing sentences.

Curious what an embedding looks like? we can print it:

print(f"Review text: {sentences_train[0]}")
print(f"Embedding vector: {embeddings_train[0][:10]}")
Review text: and it 's a lousy one at that
Embedding vector: [1.8336831, 1.5390223, 0.92042065, 0.23460366, 2.8419993, -0.65512013, 2.6017864, -3.0309973, 1.8228053, -0.57108295]

3. Train a classifier using the training set#

Now that we have the embedding we can train our classifier. We'll use an SVM from sklearn.

# initialize the support vector machine, with class_weight='balanced' because
# our training set has roughly an equal amount of positive and negative
# sentiment sentences
svm_classifier = make_pipeline(StandardScaler(), SVC(class_weight='balanced'))
# fit the support vector machine
svm_classifier.fit(embeddings_train, labels_train)

4. Evaluate the performance of the classifier on the testing set#

# get the score from the test set, and print it out to screen!
score = svm_classifier.score(embeddings_test, labels_test)
print(f"Validation accuracy on Small is {100*score}%!")
Validation accuracy on Small is 86.0%!

This was a small scale example, meant as a proof of concept and designed to illustrate how you can build a custom classifier quickly using a small amount of labelled data and Cohere's embeddings. Increase the number of training examples to achieve better performance on this task.