GPT-3.5 VS MPT-30B - A Qualitative Comparison

Fokke Dekker

Industry

Working for a company that develops data science and machine learning tools comes with some fantastic perks. One of them is the opportunity to play around with the newest and most advanced large language models (LLMs) all the time.

Recently, during our user research, we had insightful conversations with over 200 data science and ML leaders. One recurring issue in the LLM space caught our attention - testing and comparing LLMs is incredibly challenging. There are several reasons for this. From a scientific standpoint, the standard success metrics like f-measure, accuracy, and recall don't exist for LLMs

On a more practical note, the cost of deploying a model (considering infrastructure and engineering time) is prohibitively high. Some models, such as Bloomz, can rack up expenses of up to 200K per year to operate.

Deploying LLM-powered data science pipelines for our customers puts us in a unique position to evaluate their performance. Although this article doesn't claim to be a scientific analysis of LLMs, it aims to provide you with some insights into how they perform on specific tasks.

‍

We are curious to know if there's interest in the industry for a platform to evaluate different models side by side. If you're interested, let us know in the comments or shoot us an email at support@seaplane.io. With enough interest, we can justify running more models and even offer you the chance to run your own evaluations.

‍

Now, let's dive into evaluating two models, GPT-3.5 (chatGPT) and MPT-30B, an open-source model by MosaicML. We compared them across three categories based on common industry use cases identified during our user research for our data science product:

Text Generation
Logical Reasoning
Classification

For the evaluation metric, we asked our colleagues to vote on the output of these models for three questions per category. Specifically, we wanted them to focus on the general correctness of the answers, whether they aligned with the prompts, and if there were any excessive hallucinations.

Please note that this is a relatively small sample size, both in terms of the number of questions (9) and the number of votes. So, our analysis isn't scientifically rigorous, but it will certainly give you some useful insights for choosing the right model for your needs. We have provided all questions and the model output at the bottom of this article.

Text Generation

Both models performed reasonably well in this category. We asked both models to summarize a news article, create an ad for an Airbnb listing and draft an email to a landlord to complain about a broken AC system.

GPT-3.5 had a tendency to produce longer answers, even though we configured MPT-30B to allow up to 4K tokens of output. However, GPT-3.5 impressed us with its ability to generate better-formatted answers, including bulleted lists and a more structured approach to the text. This makes it our top choice for text generation.

Logical Reasoning

We posed basic logical reasoning questions to both models, such as completing numerical sequences, ordering disease timelines, and understanding family structures. The answers in this category were, to put it mildly, a bit bonkers, especially from MPT-30B, which struggled with finishing numerical sequences and rambled about numbers instead.

It's safe to say that neither model excelled in this category. However, when asked, our team slightly preferred the answers of model GPT-3.5 over MPT-30B

Classification

If we had to pick a category where both models shined, it would be Classification. We asked both models to identify the language of a text, classify visa categories, and categorize shipment types. Impressively, both models provided 3/3 correct answers.

While both models performed well, our team preferred MPT-30B over GPT-3.5 as its answers where more to the point and therefore sounded more human.

Where can LLMs provide business value?

Looking at the performance of both models across the three categories, we can begin to identify their strengths and weaknesses. Generally, LLMs perform better when guided with detailed information.

The more specifics you provide, the better the results. For instance, they can write entire articles based on just a single line input. However, relying too much on the LLM's imagination can lead to less reliable outcomes, including some "creative" but inaccurate responses. The better approach is to draft about 80% of the content yourself and let the LLM take care of the final 20%, such as improving structure and grammar. As you might have guessed, that is exactly what we did for this article.

Both LLMs struggled in the realm of logical thinking; this aligns with research on BERT, which shows that it doesn't fully comprehend the logical reasoning process. Instead, it has learned statistical patterns from the training data. (On the Paradox of Learning to Reason from Data)

On the bright side, both LLMs excelled in classification tasks, highlighting their ability to perform well when provided with more information. This opens up exciting opportunities for businesses to leverage LLMs in various areas. Specifically tedious classification tasks on a large number of records. For example, support ticket classification, visa category classification, banking transaction classification, etc. Think about your business; where are you spending hours upon hours completing tedious manual tasks that an LLM could automate?

What is Seaplane

At Seaplane, we specialize in enabling such use cases. Our expertise lies in building LLM (and other large model)-powered pipelines that can handle thousands of tasks within seconds or minutes, saving significant time and effort.

So, if you're interested in harnessing the power of LLMs for your organization, sign up for our beta today or reach out to us. We'd love to help you unlock the potential of cutting-edge machine learning and AI in production.

‍

The results

Take a look at the various outputs per question for each model here

Bring Your Apps to the World

Join the Seaplane Beta