Omni-Judge

Introduction

Omni-Judge is an open-source mathematical evaluation model designed to assess whether a solution generated by a model is correct given a problem and a standard answer. Due to the complexity of high-level mathematical problems and their solutions, designing rule-based evaluation methods can be challenging. Omni-Judge, similar to GPT-4-as-a-judge, offers automated assessment with greater efficiency and lower cost. For utilization details, please refer to this section.

Omni-Judge can be applied to various mathematical reasoning benchmarks, such as our proposed Omni-MATH.

Model Details

Omni-Judge builds on the meta-llama/Meta-Llama-3-8B-Instruct, incorporating GPT-4o evaluation data for instruction tuning. The training dataset comprises 21,451 examples, with a total of 2 epochs. Omni-Judge's performance is closely aligned with GPT-4o. We created an internal evaluation set using queries not previously seen by the model, consisting of 2,690 test samples. The agreement rate between Omni-Judge and GPT-4o evaluations is approximately 95%.

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "KbsdJames/Omni-Judge", 
    device_map="auto", 
    torch_dtype=torch.bfloat16, 
)
tokenizer = AutoTokenizer.from_pretrained(
    "KbsdJames/Omni-Judge", 
    trust_remote_code=True
)

# set terminators for decoding
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

question = "Compute the smallest positive integer $n$ for which $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}}$ is an integer."
reference_answer = "6156"
student_solution = "Let's approach this step-by-step:\n\n1) Let's denote $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}} = k$, where k is an integer.\n\n2) Square both sides:\n   $(\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}})^2 = k^2$\n\n3) Expand the left side:\n   $100+\\sqrt{n} + 100-\\sqrt{n} + 2\\sqrt{(100+\\sqrt{n})(100-\\sqrt{n})} = k^2$\n\n4) Simplify:\n   $200 + 2\\sqrt{10000-n} = k^2$\n\n5) Rearrange:\n   $\\sqrt{10000-n} = \\frac{k^2-200}{2}$\n\n6) For this to be true, $\\frac{k^2-200}{2}$ must be a perfect square. Let's call it $m^2$:\n   $\\frac{k^2-200}{2} = m^2$\n\n7) Multiply both sides by 2:\n   $k^2-200 = 2m^2$\n\n8) Rearrange:\n   $k^2 = 2m^2 + 200$\n\n9) For the smallest positive n, we need the smallest k > 14 (since $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}} > 2\\sqrt{100} = 20$) that makes this equation true.\n\n10) Testing values:\n    For k = 15: $15^2 = 225 = 2m^2 + 200$\n                $25 = 2m^2$\n                $m^2 = 12.5$ (not an integer)\n\n    For k = 16: $16^2 = 256 = 2m^2 + 200$\n                $56 = 2m^2$\n                $m^2 = 28$ (integer)\n\n11) So, the smallest valid k is 16, and m = $\\sqrt{28}$\n\n12) Now, going back to step 5:\n    $\\sqrt{10000-n} = \\frac{16^2-200}{2} = 28$\n\n13) Solve for n:\n    $10000-n = 28^2 = 784$\n    $n = 10000 - 784 = 9216$\n\nTherefore, the smallest positive integer n for which $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}}$ is an integer is 9216."

# pre-process
formatted_context = tokenizer.get_context(
    question,
    reference_answer,
    student_solution,
)
model_inputs = tokenizer(formatted_context, return_tensors="pt")
input_ids = model_inputs["input_ids"]
attention_mask = model_inputs["attention_mask"]

# do inference
pred = model.generate(
    input_ids=input_ids.to(model.device),
    attention_mask=attention_mask.to(model.device),
    do_sample = False,
    num_return_sequences = 1,
    max_new_tokens = 300,
)[0].cpu().tolist()

# post-process
pred = pred[len(input_ids[0].cpu().tolist()):]
for terminator in terminators:
    if terminator in pred:
        pred = pred[:pred.index(terminator)]
response = tokenizer.decode(pred, skip_special_tokens=True)
pred_truth = tokenizer.parse_response(response)

# if response parsing fails, the answer/judgement/justification will be None,
# which we consider as errors in prediction. 
# in this case, using multiple sampling may help.

print("answer:", pred_truth["answer"])
# >>> answer: 9216
print("judgement:", pred_truth["judgement"])
# >>> judgement: FALSE
print("justification:", pred_truth["justification"])
# >>> justification: The student's answer of 9216 is incorrect in the context of the problem, which asks for the smallest positive integer $\\(n\\)$ for which $\\(\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}}\\)$ is an integer. The reference answer is 6156. The student's solution incorrectly calculates the value of  $n$  by incorrectly identifying the smallest integer value of  $k$  and then incorrectly solving for  $n$ . The student's approach does not accurately capture the correct value of  $n$ , which is 6156, as indicated by the reference answer. Therefore, the student's answer does not share the same meaning as the reference answer.

Evaluation

Given GPT-4o judgement as the golden results, we report the performance of Omni-Judge.

For a fair comparison, the questions for train and test are different.

The results are shown below:

Source	Success of Parsing	Consistency
deepseek-coder-v2-lite-instruct	100	95.08
deepseek-math-7b-RL	99.55	94.20
mathqwen-7b-Instruct	100	95.32
mathqwen-72b-Instruct	99.78	94.65
GPT-4o	100	94.87
claude_sonnet-3-5	100	93.54
All	99.89	94.61

Citation

If you find our work helpful, feel free to give a star to our repo.