Text Summarization Modeling with ScikitLLM and Comet

Published in

Heartbeat

5 min readMar 8, 2024

Large Language Models (LLMs) are pretty remarkable. These powerful machine-learning models can understand and generate human-like text, creating a natural conversation experience. But they’re not just for chatting — they can also be used to deliver information, generate information like summaries, and get valuable suggestions.

Today, we will talk about ScikitLLM or Scikit Large Language Model. ScikitLLM is interesting because it seamlessly integrates LLMs into your traditional Scikit-learn (Sklearn) library. If you’re familiar with machine learning and statistical modeling, you know Sklearn is a powerful tool that provides users with various unsupervised and supervised learning algorithms for building robust machine learning models.

In this post, we’ll take a deep dive into ScikitLLM and explore how you can use it to build text summarization ML models and monitor them all in Comet.

What is ScikitLLM?

Scikit-LLM, described in the official Scikit-LLM GitHub repository, is scikit-learn meets large language models. This means Scikit-LLM brings the power of powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks.

One of the most fascinating aspects of this integration is that you can access various ML algorithms and leverage advanced natural language processing. Something else I found interesting was that this library maintains scikit-learn’s workflow. This means the process is still basically the same: You import your libraries, load your dataset, split your data, train with the fit method, and make predictions using the predict method.

Now, not to bore you with long talk! Let’s get started!

Getting Started with ScikitLLM

To make use of ScikitLLM, we will need to use the pip install command to install ScikitLLM:

Currently, Scikit-LLM only supports OpenAI, GPT4ALL, Google PaLM 2, and Azure OpenAI.

We will, however, make use of OpenAI. Thus, you will need to set up an OpenAI account. Once done, set up billing and generate an OpenAI API token key for this project.

Then head over to your Colab or Jupyter notebook and run this:

# importing SKLLMConfig
from skllm.config import SKLLMConfig
# Set your OpenAI API key
SKLLMConfig.set_openai_key("*******")
# Set your OpenAI organization
SKLLMConfig.set_openai_org("**ABC**")

Note: ******* represents your API token key, and **ABC** represents your organization I.D.

Text Summarization with ScikitLLM

We will use the Starbucks reviews dataset from Kaggle for the text summarization modeling. This dataset contains information about reviews, ratings, and location data from various Starbucks stores.

To summarize the reviews, we will utilize the GPTSummarizer module of the ScikitLLM library and the GPT-3.5-turbo model from OpenAI. In the GPTSummarizer module, we can use the max_word parameter to set a flexible limit on the number of words each summary produces. I said flexible because the actual length of the generated summary can be longer than the predetermined limit set. After that, we can use the fit_transform method to feed our review to our model. Once done, we can use print to see the summarized reviews.

import comet_ml
import pandas as pd
from comet_ml import Artifact, Experiment
from skllm.preprocessing import GPTSummarizer

# Load your CSV file into a DataFrame
df = pd.read_csv('/content/reviews_data.csv')

# Select a subset of reviews (e.g., the first 50)
X = df["Review"].values[:50]

# Initialize the GPTSummarizer
reviews_summarizer = GPTSummarizer(openai_model="gpt-3.5-turbo", max_words=10)

# Generate summaries with your model
generated_reviews = reviews_summarizer.fit_transform(X)

You should try running your model on the entire dataset as well. But the execution of this will take some time. To accomplish this, delete the .values[:50] from the code we previously used.

To evaluate how good our model is, we will then compute the BLEU Score. A sentence is compared to one or more reference sentences to determine the BLEU score. The output score ranges from 0 to 1. An output value of 1 indicates that the candidate sentence exactly matches one of the reference sentences. However, since we lacked a reference summary in our data, we are evaluating the generated summaries compared to themselves.

# Join the generated reviews into a single string
reviews_text = "\n".join(generated_reviews)

from nltk.translate.bleu_score import corpus_bleu

bleu_score = corpus_bleu([[summary.split()] for summary in generated_reviews], [summary.split() for summary in generated_reviews])

Running Our Model In Comet

Once done with the modeling, we can log everything in Comet.

To accomplish that, we must first create a project in Comet. Here, we will record all relevant information, including metrics and created reviews, as true. However, if you don’t already have a Comet account, you must create one. You can find the name of your workspace here.

The next step will be to log all our generated reviews from our model as text.

To log in our artifacts, we create an artifact instance by giving it a name, artifact_type, and specifying the file path with artifact.add(). Artifacts here means dataset.

# Initialize Comet
experiment = comet_ml.Experiment(
    project_name="Text Summarization",
    workspace="bennykillua",
    api_key="YOUR KEY",
    auto_metric_logging=True,
    auto_param_logging=True,
    auto_histogram_weight_logging=True,
    auto_histogram_gradient_logging=True,
    auto_histogram_activation_logging=True,
    log_code=True
)

# Log the generated reviews as text
experiment.log_text("Generated Reviews", reviews_text)

comet_ml.init(api_key="YOUR KEY")

# Initialize an Artifact
artifact = Artifact(name="Reviews", artifact_type="dataset")

# Specify the path of the artifact (the dataset file)
artifact.add(r"/content/reviews_data.csv")

# Log the artifact to the experiment (Comet platform)
experiment.log_artifact(artifact)

# Log the BLEU score
experiment.log_metric("BLEU Score", bleu_score)

# Log the model
experiment.log_model(reviews_summarizer, 'model') 

# End the experiment
experiment.end()

You can view the logged model on the Comet platform.

Here is the complete code:

Conclusion

Integrating Large Language Models with scikit-learn through the SKLLM library allows us to leverage advanced language understanding for various machine learning algorithms. Furthermore, by leveraging Comet, you have a user-friendly interface to track and optimize your hyperparameters and collaborate with other data scientists.

P.S. If you prefer to learn by code, check out this Github gist, which hosts the code snippets. Also, do check out the logged model on the Comet platform.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.