Ayumi Benchmark Version 3

For more information about the various scores and how this benchmark works please look at the original page: Ayumi's LLM Role Play & ERP Ranking - https://rentry.co/ayumi_erp_rating. If you find the benchmark data useful and want to contribute: You can donate via Ko-fi - https://ko-fi.com/weicon

Not satisfied with this benchmark? Checkout this site with community driven ratings and reviews of LLMs: BestERP - https://besterp.ai/

Filter Guide

Syntax:
filter-expr ::= or-expr or-expr ::= (and-expr)+ | (and-expr)+ "|" or-expr and-expr ::= negative-match | match # a match word with a "-" in front: negative-match ::= "-" match # every character except " " or "|" match ::= [^| ]+

Example LLaMA-2 Models: "L2 | llama 2"

Example Airoboros GPT4 models not version 1.4: "airoboros GPT4 -1.4"

Column Description

Column Description

ALC-IQ3 The ALC-IQ3 is the 3rd version of the ALC-IQ. It tries to determine how well a model understands a character card. The higher the better. Best score is 100.

IQ Entropy The IQ Entropy is not part of the ranking. It's just the (normalized) Entropy of the Yes/No answer probabilities. It's just a slightly different measure of the ALC-IQ3.

ERP3 Score The average ratio of lewd words vs. words in a response. The higher the better.

Var Score The lewd word variety score. It counts how many different lewd words occur in all ERP responses

Rank ERP3 Response Link Size Q ALC-IQ3 IQ3 Entropy ERP3 Score ERP3 Variety

Column	Description
ALC-IQ3	The ALC-IQ3 is the 3rd version of the ALC-IQ. It tries to determine how well a model understands a character card. The higher the better. Best score is 100.
IQ Entropy	The IQ Entropy is not part of the ranking. It's just the (normalized) Entropy of the Yes/No answer probabilities. It's just a slightly different measure of the ALC-IQ3.
ERP3 Score	The average ratio of lewd words vs. words in a response. The higher the better.
Var Score	The lewd word variety score. It counts how many different lewd words occur in all ERP responses