Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: CLEVER: A Benchmark for Verified Code Generation in Lean

Amitayush Thakur (May 23 2025 at 21:07):

CLEVER: A Benchmark for Verified Code Generation in Lean

While Lean has a rich tradition in formal mathematics, its use for verified program synthesis is still emerging. As interest grows in connecting language models with formal reasoning, we believe there's value in having a benchmark that tests the ability to generate not just Lean code, but fully verified programs—starting from natural language.

We’re introducing CLEVER, a benchmark of 161 programming tasks designed to study end-to-end, machine-verified code generation in Lean.

Each task follows a two-phase process:

Specification Certification: From a natural language prompt, generate a Lean specification and prove it equivalent to a hidden reference specification.
Implementation Certification: Given the verified specification, generate a Lean implementation and prove that it satisfies the spec.

All verification is performed by Lean’s type checker, with no external evaluation. A key benchmark design choice is that all specifications are written as non-computable propositions. This prevents direct execution or leakage of implementation logic into the spec, and ensures models must engage in genuine symbolic reasoning rather than mirroring or approximating behavior.

We release baseline results using large language models and a neuro-symbolic proof agent to help frame the current landscape.

Resources:
• Paper: https://arxiv.org/abs/2505.13938
• Code & Benchmark: https://github.com/trishullab/clever
• Dataset: https://huggingface.co/datasets/amitayusht/clever
• Proof Agent (COPRA): https://github.com/trishullab/clever-prover

Thanks to co-authors Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri. We hope CLEVER is a helpful step toward richer verified synthesis in Lean. We’d love to hear your thoughts, ideas, or suggestions.

Last updated: Feb 28 2026 at 14:05 UTC