Assessing the performance of a large language model can be quite challenging. Traditional performance metrics like F1-score, precision, and recall are not directly applicable, posing a significant challenge for anyone working with them in production.
One major issue is the difficulty in effectively comparing different large language models (LLMs). Many companies we've engaged with resort to manual comparisons. They typically select a set of questions and rely on employees or a team to determine their preference for each model's answers. In our earlier blog post titled "GPT-3.5 VS MPT-30B - A Qualitative Comparison," we delved into this approach. While it's a valid method, it remains an essentially qualitative experiment, heavily influenced by individual opinions.
Additionally, recent research indicates a decline in the performance of well-known OpenAI LLMs over time. You can read more about this in the study titled "Exploring the Impact of Temporal Dynamics on Language Model Evaluation." This discovery underscores the increased significance of continuously assessing the performance of your chosen LLM.
So, how can we approach the evaluation of LLMs in a more objective manner? This blog article will delve into a methodology rooted in mathematical analysis rather than subjective viewpoints. First, we will delve into the specifics of the math behind it followed by a framework for testing LLMs over time.
To compare the performance of two LLMs, or one LLM over time, we are going to use a concept called cosine similarity. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths.
In simpler terms, cosine similarity helps us understand how similar two vectors are. Now, let's see how this concept assists us in the assessment of LLMs. This is where embeddings come into play once more. If you've gone through our previous blog post on in-context learning, you're familiar with the idea that we can represent any text as a vector by utilizing embeddings.
By doing so, we can compare two strings (pieces of text) by first using an embedding model to represent both as a vector and then compute the cosine similarity of those vectors.
The code above uses the seaplane_embeddings functionality to turn two strings, A and B, into vectors and compare them using the cosine similarity score as a number between zero and one, where zero is not at all similar, and one is exactly the same.
Now that we know what cosine similarity is and how we can use it, we can look at a real-world example. Let’s compare the output of both GPT-3.5 and MPT-30B against a reference prompt and answer. We ask both models who is the president of the USA and what is his job. Or, more specifically:
Who is the united states president, and what is his job? Answer in no more than two sentences.
We can get a reference answer for this question by creating a two-sentence summary of the definition of the function of the US president by whitehous.gov.
The generated responses of the prompt are as follows for MPT-30B and GPT-3.5.
We can use the simple Python script we linked above to compare the results of each prompt to our reference answers. Giving us the following two cosine similarity scores.
Cosine Similarity GPT: 0.9111925123456341
Cosine Similarity MPT: 0.9094120847934666
Based on these numbers, we can conclude that both answers are close to the reference answer, but GPT-3.5 outperforms MPT-30B by a small margin.
We now have a quantitative comparison of MPT-30B and GPT-3.5. However, we should highlight there is one major drawback to this approach. It requires us to create a set of reference answers and questions which can be time-consuming and hard. But as we will see in the framework below, we only need to do this once and it enables a powerful measurement tool for LLMs over time.
A Framework For Testing Your LLM Over Time
By now, we know what cosine similarity is and how we can use it to test the performance of LLMs. Finally, let’s look at how we can apply these principles to monitor the performance of our LLMs over time.
To evaluate the performance of your LLM, we first need a ground source of truth and the associated prompts we are feeding our LLMs to generate a hopefully similar response. How many questions on which topics really depends on your use case, but we recommend you keep the following in mind.
- Ensure your prompts are specific and limiting. For example, limit the maximum number of output sentences.
- If you are using in-context learning, ensure you always feed it the same information and limit the LLM’s response to only the data you feed it.
- Ensure your reference question mimics your real use case as close as possible. For example, there is no value in knowing that an LLM can correctly answer questions about the US president if your use case is answering questions about software engineering. This one seems obvious. But the important part here is that the reference questions need to reflect the work you expect the LLM to do for you.
Once you have created your testing prompts and the reference answers, build a pipeline that checks these answers. How often you run this pipeline depends on your specific situation, we recommend you check it at least once a week and preferably every day. The sooner you catch any issues, the sooner you can address them.
An LLM performance measuring pipeline on Seaplane might look something like this.
Our pipeline is of type cronjob, which triggers on a daily basis and consists of three tasks. We store our prompts, reference answers, and their embeddings in a vector store which we add as a resource to the LLM inference tasks and the cosine check task. In the LLM inference check, we run each prompt through the LLM and pass the answer along to the cosine check task. In the cosine check task, we create vector embeddings of the new inference results and compare them to our reference embeddings stored in the vector store. Results are stored in a SQL store at the end of the pipeline.
Over time you build up insights on the LLM performance in your specific domain. We can build out specific triggers to notify us over Slack, email, or any other tool if the performance of the LLM drops below a certain value; i.e., if my cosine similarity is less than .6, send a notification.
While this is an excellent approach to evaluating an LLM performance over time, we should not overlook the downsides.
- Overlooking Context: Two answers may score high in similarity but differ in contextual appropriateness. This nuance is something cosine similarity might overlook.
- Embedding Biases: The quality and nature of results are influenced by the embedding model used, each coming with its own strengths and weaknesses.
- Not the Sole Metric: While powerful, cosine similarity shouldn't be the only metric used. Incorporating other evaluation methods ensures a comprehensive assessment of an LLM's prowess.
- Initial set up requires manual work: The initial setup requires some manual work in drafting reference questions and answers. We recommend you spend quite a bit of time on these to ensure they match your use case.
In conclusion, the mathematical comparison of LLMs with each other or reference answers can be a powerful tool to use as a quantitative measure of correctness. It can help take the human out of the loop (once the reference questions and answers are drafted) and enable automated testing of LLM performance over time.
Seaplane is the platform for building and deploying Data Science and Machine Learning experiments in production-ready environments. Seaplane is jam-packed with all the resources you need to get started, including hosted vector stores, SQL databases, large language models, and much more. Sign up for the beta today.
You can learn more about Seaplane in our documentation pages.