Average of BaFC A, BaFC B, BaFC C, and BaFC D.
Back-and-forth conversations: One piece of information is provided, followed by a question about it.
Back-and-forth conversations: One piece of information is provided, followed by a question about it. The information avoids personal data or common knowledge, forcing the model to reason and removing privacy concerns.
Back-and-forth conversations: The first message gives several details; the following 2 messages change the subject, which might "confuse" the model, and then a question about the first message is asked.
Back-and-forth conversations: The first message gives several details; the following 2 messages change the subject, which might "confuse" the model, and then a question about the first message is asked. The information avoids personal data or common knowledge, forcing the model to reason and removing privacy concerns.
Average of Tools A and Tool B.
These are straightforward math questions that could use a calculator.
Implied math questions: the possible need for a calculator must be inferred.
Provides a conversation starter in a language, and the model must respond in that language.
A weighted average, where RTTFC has 10 times more weight than ARTTTC and ACRTFO.
How fast was the first character received compared to the total request time? If the first character is received rapidly relative to the request's entire duration, it indicates efficient streaming. Conversely, very long durations suggest an absence of streaming.
Average speed for the slowest 10% of streaming events compared to the total request duration. High numbers mean that the mode delivers partial characters at high speed, creating the feeling of "live typing." A low number can cause the model to appear like it "freezes."
Average characters per stream event compared to final output characters count. Models with a higher number produce text more quickly, simulating "live typing." High ACRTFO scores are irrelevant if RTTFC and ARTTTC are low: receiving 10 stream events after a 5-second delay doesn't create the feeling of streaming.
The highest CPS sets the perfect score at 100%. Scores for each model are then compared to this number.
Characters count / completion time in seconds for the request.
Weighted average of Input Price (1x), Output Price (2x), Average Number of Output Tokens (1x), and Average Cost per 10,000 Prompts (3x).
Input pricing is normalized to USD and the number of tokens. The lower the price, the higher the percentage score the model has in Cost Efficiency.
Output pricing is normalized to USD and the number of tokens. The lower the price, the higher the percentage score the model has in Cost Efficiency.
The number of tokens a model generates matters; a seemingly cheaper model might cost more if it produces too many tokens. For instance, a model half as expensive as another might not save you money if it generates twice as many tokens for the same prompt.
Average Output Tokens * 10k * USD per Output Token
A reproduction of 1,760 questions from the MMLU (Massive Multitask Language Understanding) benchmark.