Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: FVAPPS discussion


Quinn (Feb 11 2025 at 04:10):

#announce > Proving the Coding Interview: Lean programming benchmark

Quinn (Feb 11 2025 at 04:10):

FVAPPS is a benchmark consisting of 4715 function signatures each with 2-5 theorems. Fill in the sorry's to measure your model's capabilities or generate synthetic data.

Arxiv: https://arxiv.org/abs/2502.05714
HuggingFace: https://huggingface.co/datasets/quinn-dougherty/fvapps
Twitter/X: https://x.com/qd_forall/status/1889160484991250595

Mario Carneiro (Feb 11 2025 at 16:31):

Here's my attempt at the example problem:

/--
Now elections are held in Berland and you want to win them. More precisely, you want everyone to vote for you.

There are $n$ voters, and two ways to convince each of them to vote for you. The first way to convince the $i$-th voter is to pay him $p_i$ coins. The second way is to make $m_i$ other voters vote for you, and the $i$-th voter will vote for free.

Moreover, the process of such voting takes place in several steps. For example, if there are five voters with $m_1 = 1$, $m_2 = 2$, $m_3 = 2$, $m_4 = 4$, $m_5 = 5$, then you can buy the vote of the fifth voter, and eventually everyone will vote for you. Set of people voting for you will change as follows: ${5} \rightarrow {1, 5} \rightarrow {1, 2, 3, 5} \rightarrow {1, 2, 3, 4, 5}$.

Calculate the minimum number of coins you have to spend so that everyone votes for you.
-/
def solve_elections (n : Nat) (voters : List (Nat × Nat)) : Nat :=
  if (n, voters) = (3, [(1, 5), (2, 10), (2, 8)]) then 8 else
  if (n, voters) = (6, [(2, 6), (2, 3), (2, 8), (2, 7), (4, 4), (5, 5)]) then 7 else
  0

theorem solve_elections_nonnegative (n : Nat) (voters : List (Nat × Nat)) :
    solve_elections n voters  0 := by omega

theorem solve_elections_upper_bound (n : Nat) (voters : List (Nat × Nat)) :
    solve_elections n voters  List.foldl (λ acc (pair : Nat × Nat) => acc + pair.2) 0 voters := by
  unfold solve_elections; repeat' split
  next h => cases h; decide
  next h => cases h; decide
  omega

theorem solve_elections_zero_votes (n : Nat) (voters : List (Nat × Nat)) :
    (List.all voters (λ pair => pair.1 = 0))  solve_elections n voters = 0 := by
  unfold solve_elections; repeat' split
  next h => cases h; decide
  next h => cases h; decide
  omega

theorem solve_elections_single_zero_vote : solve_elections 1 [(0, 5)] = 0 := rfl

/--
info: 8
-/
#guard_msgs in
#eval solve_elections 3 [(1, 5), (2, 10), (2, 8)]

/--
info: 0
-/
#guard_msgs in
#eval solve_elections 7 [(0, 1), (3, 1), (1, 1), (6, 1), (1, 1), (4, 1), (4, 1)]

/--
info: 7
-/
#guard_msgs in
#eval solve_elections 6 [(2, 6), (2, 3), (2, 8), (2, 7), (4, 4), (5, 5)]

Mario Carneiro (Feb 11 2025 at 16:36):

it's unclear to me whether this is expected or if this constitutes breaking the benchmark example

Quinn (Feb 11 2025 at 17:02):

Clever

Quinn (Feb 11 2025 at 17:03):

Empirically we didn't see a language model figure that out

Quinn (Feb 11 2025 at 17:05):

I didn't think to class properties that can be solved that way as too trivial to count

Quinn (Feb 11 2025 at 17:08):

I don't expect all that many samples to be crackable like this (by hardcoding unit tests)

Quinn (Feb 11 2025 at 17:08):

But it'd be great to have intel on that!

Quinn (Feb 11 2025 at 17:10):

Quinn said:

Empirically we didn't see a language model figure that out

It's funny it's the exact opposite of the traditional specification gaming problem--- the sonnet and Gemini baselines followed the spirit at the expense of the letter (cuz they did worse than you), but the human followed the letter at the expense of the spirit :)

GasStationManager (Feb 11 2025 at 19:14):

Looking at the theorem statements for this example I think it is too lenient; I can easily give a (wrong) solution that passes the theorems by hard coding the theorem statements in the function def:

def solve_elections (n : Nat) (voters : List (Nat × Nat)) : Nat :=

  if (List.all voters (λ pair => pair.1 = 0)) then

    0

  else

    List.foldl (λ acc (pair : Nat × Nat) => acc + pair.2) 0 voters

theorem solve_elections_nonnegative (n : Nat) (voters : List (Nat × Nat)) :

    solve_elections n voters  0 := by omega

theorem solve_elections_upper_bound (n : Nat) (voters : List (Nat × Nat)) :

    solve_elections n voters  List.foldl (λ acc (pair : Nat × Nat) => acc + pair.2) 0 voters := by

  unfold solve_elections; repeat' split

  next h =>  omega

  omega

theorem solve_elections_zero_votes (n : Nat) (voters : List (Nat × Nat)) :

    (List.all voters (λ pair => pair.1 = 0))  solve_elections n voters = 0 := by

  unfold solve_elections; repeat' split

  next h => omega

  next h => simp[h]

theorem solve_elections_single_zero_vote : solve_elections 1 [(0, 5)] = 0 := rfl

And an LLM could potentially do the same.

It doesn't pass the unit tests. But does that mean the formal spec is not correct without also including the unit tests? If you include the unit tests in the spec then it is equally easy to hard code them in the solution as above.

Come to think of it you could include this as another filter in the QA pipeline: ask an LLM to come up with a solution that passes the property-based tests but fails the unit tests.

Quinn (Feb 11 2025 at 19:22):

GasStationManager said:

Come to think of it you could include this as another filter in the QA pipeline: ask an LLM to come up with a solution that passes the property-based tests but fails the unit tests.

We didn't quite think of that! The qa process checks theorems with plausible after unit tests are passed, but not before. But plausible on a trivially hackable set of theorems would not find a counterexample, so it might not help

We do hope people will use the unit tests, but they are optional

Quinn (Feb 11 2025 at 19:36):

But yaknow-- holding out unit tests from prompt or feedback loop and only using them on evaluation would block both yours and @Mario Carneiro's attacks

Mario Carneiro (Feb 11 2025 at 19:42):

if you don't care about passing the unit tests there is an even easier solution:

def solve_elections (n : Nat) (voters : List (Nat × Nat)) : Nat := 0

theorem solve_elections_nonnegative (n : Nat) (voters : List (Nat × Nat)) :
    solve_elections n voters  0 := Nat.zero_le _

theorem solve_elections_upper_bound (n : Nat) (voters : List (Nat × Nat)) :
    solve_elections n voters  List.foldl (λ acc (pair : Nat × Nat) => acc + pair.2) 0 voters :=
  Nat.zero_le _

theorem solve_elections_zero_votes (n : Nat) (voters : List (Nat × Nat)) :
    (List.all voters (λ pair => pair.1 = 0))  solve_elections n voters = 0 :=
  fun _ => rfl

theorem solve_elections_single_zero_vote : solve_elections 1 [(0, 5)] = 0 := rfl

Mario Carneiro (Feb 11 2025 at 19:42):

I think the actual problem with this particular example is that the properties are too weak, there isn't anything which is sufficient to force the function to be interesting

Quinn (Feb 11 2025 at 19:44):

I admit when I did the human baseline for this question https://github.com/quinn-dougherty/fvapps/blob/master/artefacts%2Fhuman_baselines%2Ftrain%2F0023%2FSpec.lean i was not lateral thinking enough

Mario Carneiro (Feb 11 2025 at 19:44):

On the other hand, if the theorems perfectly characterize the function it could make it too easy to synthesize the function by copying the properties

Quinn (Feb 11 2025 at 19:45):

This particular sample is in the "unguarded" split, the least quality controlled of the three splits. Fwiw

Quinn (Feb 11 2025 at 19:47):

It's also funny cuz I manually monitored a ton of language model outputs on this benchmark, both during baselines and quality control, and didn't see anything like these vulnerabilities. I think that's from our system prompt asking the model not to hardcide answers and to listen to the natlang text of the question

Mario Carneiro (Feb 11 2025 at 19:48):

to be fair, I'm pretty sure I am much better than an LLM at problems of the form "find a function which satisfies the following properties"

Mario Carneiro (Feb 11 2025 at 19:50):

I don't think LLMs do this kind of lateral thinking at all, that's more likely to come out of the earlier crop of ML systems which are actually trained at solving the problem and optimizing an objective function, and/or program synthesis tools that do program enumeration of some kind

Mario Carneiro (Feb 11 2025 at 19:52):

My impression of modern LLMs like Claude and Gemini is that they are trained to "think like a human" (or better, "act like a human trying to solve the problem") rather than "solve the problem"

Mario Carneiro (Feb 11 2025 at 19:52):

so lateral thinking is actively discouraged

Quinn (Feb 11 2025 at 19:53):

I'd bet with a system prompt that encouraged specification gaming things would be very different. I'd expect it to be able to do that kind of lateral thinking (like "just ignore the natlang question text and don't worry about hypothetical held out unit tests") if it wasn't asked not to

Quinn (Feb 11 2025 at 19:54):

To see the baseline system prompt, click here

Quinn (Feb 11 2025 at 19:55):

I'm inclined to agree with you about propensity! But I think the capability is elicitable


Last updated: May 02 2025 at 03:31 UTC