GPTQ LLM Leaderboard Report #1

With tons of open-source large language models coming every single day, its hard to keep track.

GPTQ LLM Leaderboard Report #1
Image generated by Stable Diffusion

In this blog series, I'll be sharing performance results for a bunch of LLMs that were assessed using EleutherAI's Language Model Evaluation Harness. The models were mainly sourced from HuggingFace. I also note all of the results to my excel. Tell me what GPTQ or 4-bit models should I test for the next LLM blog.

Tasks

I'm using GPT4All standard of tasks (no shot) which include

  • BoolQ
  • PIQA
  • HellaSwag
  • WinoGrande
  • ARC-e and ARC-c (AI2's Reasoning Challenge)
  • OBQA (OpenBook QA)

Arguments

python main.py \
    --model hf-causal-experimental \
    --model_args pretrained=<model_path>,quantized=<model_name>,gptq_use_triton=True,trust_remote_code=True \
    --tasks boolq,piqa,winogrande,arc_easy,arc_challenge,openbookqa,hellaswag \
    --device cuda:0 --no_cache

Models tested

Results

Model Average BoolQ PIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
TheBloke/Manticore-13B-GPTQ 64.1 80 79.8 59.6 72.4 77.9 45.7 33
TheBloke/Nous-Hermes-13B-GPTQ 64.8 79.9 78.9 60.9 71.2 77.8 48.4 36.6
TheBloke/tulu-13B-GPTQ 64.9 84.04 79 60.27 72.77 77.1 47.35 34
TheBloke/guanaco-13B-GPTQ 65 80.2 79.1 61.8 72.6 76.3 46.9 37.8
digitous/13B-HyperMantis_GPTQ_4bit-128g 66 82.6 80 61.5 73.3 79.1 49.1 36.2
TheBloke/guanaco-33B-GPTQ 66.9 81.7 80.6 63.3 74.2 80 51.3 37
MetaIX/GPT4-X-Alpaca-30B-4bit 67.2 83.24 81.12 63.49 74.35 80.3 52.47 35.4
TheBloke/tulu-30B-GPTQ 67.8 85.9 80.41 62.78 75.3 80.98 52.22 37.2
TheBloke/SuperPlatty-30B-GPTQ 68.8 87 80.96 62.56 74.11 83.29 56.31 37.6