March Model Madness - A scientific and fun approach to crown an annual LLM winner

Seaplane

One of our favorite times of the year is NCAA Basketball Tournament season, March Madness. The beauty of this series of events is the excitement of competition, the scientific approach to game analysis, and, of course, the upsets. This got us thinking about a new type of competition for the AI Community but we need your help.

We’ve also noticed the constant social media buzz in 2023 and so far in 2024 when Large Language Model (LLM) after Large Language Model (LLM) are proclaimed to be “the next ChatGPT.” However, as you know in many cases, they were not. Arguments ensue over performance on this metric vs that metric, one is good for chat but others good for coding, etc. We have that debate ourselves all the time since on the Seaplane Platform we have the luxury of having access to all the latest and greatest production models from OpenAI, Gemini, Llama, Mixtral, and more with a single line of Python. We and our customers get to test and play with any/all of these LLMs every day. However, we recognize that many people don’t have the same access to every major Foundational Model (FM) and Open-Source LLM of interest and are then forced to do their own research on say Huggingface or the Chat Arena by LMSYS.org. Both are great organizations, doing great things but they each seemed to lack one or more of the exciting, scientific, competitive and fun approaches we were looking for.

Introducing March Model Madness (MMM for short). A new annual competition in March that will pit 16 of the top FM and LLMs against each other in a competitive battle measured on Chat, Instruct and Code. Additionally, we will have some fun by running a set of Text-to-Image models in the smaller fourth bracket in a little less of a scientific way. To keep it exciting, MMM will be a blind competition where the winner, remaining unexposed, will proceed to the next competitive round while only the loser will be exposed. Think “The Masked Singer” but for genAI and not singing battles.

In order to keep it scientific, we will do a few things. First, a set of competitive questions will be applied to all models, differing by round, in a consistent and rigorous fashion. Second, all questions, answers, and vote summaries for all models in this competition will be published so that the community of participants can leverage these results for their own testing, tuning, and general AI research. Last but not least, we will be able to apply a handful of scientific scoring methods to the input and output datasets.

‍

The CHAT bracket will consist of the following models:

GPT4, GPT3.5, Gemini Pro, Claude 2.1, Claude Instant, Jurassic2 Ultra, LLama2-70b-chat, LLama2-13b-chat, LLama2-7b-chat, mistral-7b-v0.1, starling-lm-7b-alpha, zephyr-7b-beta, yi-34b-chat, falcon-40b-instruct, mixtral-8x7b-instruct-v0.1 and yi-6b

‍

The INSTRUCT bracket will consist of the following models:

GPT4, GPT3.5, Gemini Pro, Claude 2.1, Claude Instant. Jurassic2 Ultra, LLama2-70b, LLama2-13b, LLama2-7b, mistral-7b-v0.1, vicuna-13b, zephyr-7b-beta, starling-lm-7b-alpha, falcon-40b-instruct, mixtral-8x7b-instruct-v0.1 and Claude 3

‍

The CODE bracket will consist of the following models:

GPT4, GPT3.5, Gemini Pro, Codey, Claude 2.1, Jurassic2 Ultra, codellama-70b-instruct, codellama-34b-instruct, Claude 3-instruct, codellama-70b-python, codellama-34b-python, codellama-7b-python, wizardcoder-34b-v1.0, mixtral-8x7b-instruct-v0.1, mistral-7b-v0.1, olmo-7b

‍

The IMAGE CREATION smaller bracket will consist of the following models:

gemini-1.0-pro-vision-001, SDXL-1.0, Midjourney v6, and DALL-E 3

‍

We hope you will help us by submitting questions, up-voting proposed questions to identify the best questions for the competitive contest, and sharing with your friends and colleagues to actually vote for the best results during the contest. To incentivize the world to help, we will add some prizes to make the contest and your time spent even more worthwhile.

Check out www.marchmodelmadness.com and sign up to participate today! Submit your best prompt ideas now and the tournament voting will start on March 25th.

‍

Get Access Now!

Seaplane is where AI-infused applications take flight in 2024 so join us for this ride by signing up below. We are currently in Beta so make sure to sign up to secure your place in line and we’ll get to you as fast as we can.

‍

Sign up for the beta here!

‍

Bring Your Apps to the World

Join the Seaplane Beta