LBPE Score: A New Perspective for Evaluating AI LLMs

Back-and-forth conversations

Average of BaFC A, BaFC B, BaFC C, and BaFC D.

BaFC A

Back-and-forth conversations: One piece of information is provided, followed by a question about it.

BaFC B

Back-and-forth conversations: One piece of information is provided, followed by a question about it. The information avoids personal data or common knowledge, forcing the model to reason and removing privacy concerns.

BaFC C

Back-and-forth conversations: The first message gives several details; the following 2 messages change the subject, which might "confuse" the model, and then a question about the first message is asked.

BaFC D

Back-and-forth conversations: The first message gives several details; the following 2 messages change the subject, which might "confuse" the model, and then a question about the first message is asked. The information avoids personal data or common knowledge, forcing the model to reason and removing privacy concerns.

Tools (Functions)

Average of Tools A and Tool B.

Tools A

These are straightforward math questions that could use a calculator.

Tools B

Implied math questions: the possible need for a calculator must be inferred.

Polyglotism

Provides a conversation starter in a language, and the model must respond in that language.

Streaming

A weighted average, where RTTFC has 10 times more weight than ARTTTC and ACRTFO.

RTTFC

How fast was the first character received compared to the total request time? If the first character is received rapidly relative to the request's entire duration, it indicates efficient streaming. Conversely, very long durations suggest an absence of streaming.

ARTTTC

Average speed for the slowest 10% of streaming events compared to the total request duration. High numbers mean that the mode delivers partial characters at high speed, creating the feeling of "live typing." A low number can cause the model to appear like it "freezes."

ACRTFO

Average characters per stream event compared to final output characters count. Models with a higher number produce text more quickly, simulating "live typing." High ACRTFO scores are irrelevant if RTTFC and ARTTTC are low: receiving 10 stream events after a 5-second delay doesn't create the feeling of streaming.

Latency

The highest CPS sets the perfect score at 100%. Scores for each model are then compared to this number.

CPS

Characters count / completion time in seconds for the request.

Pricing

Weighted average of Input Price (1x), Output Price (2x), Average Number of Output Tokens (1x), and Average Cost per 10,000 Prompts (3x).

Input: USD / 1M tokens

Input pricing is normalized to USD and the number of tokens. The lower the price, the higher the percentage score the model has in Cost Efficiency.

Output: USD / 1M tokens

Output pricing is normalized to USD and the number of tokens. The lower the price, the higher the percentage score the model has in Cost Efficiency.

Average Output Tokens

The number of tokens a model generates matters; a seemingly cheaper model might cost more if it produces too many tokens. For instance, a model half as expensive as another might not save you money if it generates twice as many tokens for the same prompt.

Average Cost / 10k Prompts

Average Output Tokens * 10k * USD per Output Token

MMLU

A reproduction of 1,760 questions from the MMLU (Massive Multitask Language Understanding) benchmark.

ENEM

A reproduction of 360 questions from the ENEM (Brazilian University Admission Exam) test.

LBPE Score