March Model Madness Results


If you’ve been following along with March Model Madness (MMM), there have been surprising results and many tie-breakers. Smaller models defeated larger models, and older versions were victorious against newer iterations. Because MMM was a blind tournament, where users saw the responses without knowing which models produced them, the wins and losses were determined without bias towards larger, famous, or newer models. So, ultimately, who won?

  • Chat: Gemini Pro
  • Code: GPT3.5
  • Instruct: GPT4
  • Image: OpenAI Dall-e-3

How did it all happen? Which matchups were outright slam dunks, and which yielded ties that led to overtime?

Let’s get into some exciting matches! 👀 🥁


Before highlighting some of the exciting behind-the-scenes playbacks, why did we create this event, and how did it all get started? MMM is the first edition of MLOps Community x Seaplane’s new annual event, where we set the stage for a blind model tournament.

When we say Large Language Models (LLMs), ChatGPT is usually the first model that comes to mind. With the rise of LLMs and Small Language Models (SLMs) that are not yet mainstream or household names, Seaplane decided to host a blind tournament with MLOps Community where users can judge the responses of 16 LLMs per category (chat, instruct, and code) without knowing who is who.

We removed the preferential biases by giving models aliases; the models’ names were revealed only once they were eliminated.

Tournament Structure

There are over 40 models available on Seaplane; we chose 16 models for the Chat, Code, and Instruct categories. We had four of the top models for the Text-2-Image (Image) category.

The event was divided into two different tournament styles. The Chat, Code, and Instruct categories competed in a bracket system, where sixteen models faced off in one-on-one matches. Each match resulted in one winner and one loser. The loser was knocked out, and the winner moved on. Round one included 16 models, then eight in round two, four in round three, etc., leading to one winner (can you guess who?).

Chat Tournament Results
Code Tournament Results
Instruct Tournament Results
Image Tournament Results

The Image models competed in a round-robin tournament. One by one facing each other.

Win and Lose Conditions

At the end of each round, we tallied the results. For each face-off, the model that won the most prompts moved forward. In case of a tie, we looked at the total number of votes rather than per prompt.

Voting Page Example

Interesting Results

Seeing the winners for each category, we aren’t surprised to see GPT4 among them. However, this mini tournament showed that perhaps not all new iterations and larger models are the crowd favorites.

In the Chat category, the smaller models with 7b parameters, Zephyr-7b, and Starling-lm-7b, beat their competitors with more parameters, respectively, Llama-2-13b and ChatGPT-3.5 (rumored to have about 175 billion parameters).

In the Code category, we see a smaller sibling win against a larger sibling in Round 1, with Codellama-34b-python winning against Codellama-70b-python. Then, in Round 2, Codellama-7b-instruct won against Codellama-34b-python.

Code Category - Round 1 And 2

These results show that when determining which model to incorporate into our applications, bigger is perhaps not always better.

We hope that the tournament was as much fun for everyone as it was for us hosting it! The results are a surprise and an incentive to explore and try out all the different available models before choosing which to incorporate into your next application. You can access over 40+ models through Seaplane’s platform. For more information and to sign up please visit us here.

Look out for our second blog, where we dive deeper into the specifics of the matches and how some models’ results compare to the published benchmarks in their papers. We will let you in on a little secret: benchmarks aren’t everything!

Subscribe to our newsletter
Please insert valid email address

Bring Your Apps to the World

Join the Seaplane Beta