LLM

GPTQ LLM Leaderboard Report #1

With tons of open-source large language models coming every single day, its hard to keep track.

hgloow

Jul 6, 2023 • 1 min read

Image generated by Stable Diffusion

In this blog series, I'll be sharing performance results for a bunch of LLMs that were assessed using EleutherAI's Language Model Evaluation Harness. The models were mainly sourced from HuggingFace. I also note all of the results to my excel. Tell me what GPTQ or 4-bit models should I test for the next LLM blog.

Tasks

I'm using GPT4All standard of tasks (no shot) which include

BoolQ
PIQA
HellaSwag
WinoGrande
ARC-e and ARC-c (AI2's Reasoning Challenge)
OBQA (OpenBook QA)

Arguments

python main.py \
    --model hf-causal-experimental \
    --model_args pretrained=<model_path>,quantized=<model_name>,gptq_use_triton=True,trust_remote_code=True \
    --tasks boolq,piqa,winogrande,arc_easy,arc_challenge,openbookqa,hellaswag \
    --device cuda:0 --no_cache

Models tested

Results

Model	Average	BoolQ	PIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA
TheBloke/Manticore-13B-GPTQ	64.1	80	79.8	59.6	72.4	77.9	45.7	33
TheBloke/Nous-Hermes-13B-GPTQ	64.8	79.9	78.9	60.9	71.2	77.8	48.4	36.6
TheBloke/tulu-13B-GPTQ	64.9	84.04	79	60.27	72.77	77.1	47.35	34
TheBloke/guanaco-13B-GPTQ	65	80.2	79.1	61.8	72.6	76.3	46.9	37.8
digitous/13B-HyperMantis_GPTQ_4bit-128g	66	82.6	80	61.5	73.3	79.1	49.1	36.2
TheBloke/guanaco-33B-GPTQ	66.9	81.7	80.6	63.3	74.2	80	51.3	37
MetaIX/GPT4-X-Alpaca-30B-4bit	67.2	83.24	81.12	63.49	74.35	80.3	52.47	35.4
TheBloke/tulu-30B-GPTQ	67.8	85.9	80.41	62.78	75.3	80.98	52.22	37.2
TheBloke/SuperPlatty-30B-GPTQ	68.8	87	80.96	62.56	74.11	83.29	56.31	37.6