Zulip Chat Archive

Stream: Program verification

Topic: Bitvector


Xi Wang (Aug 16 2020 at 23:30):

Hi,

We have been working on a formal verification tool for the BPF just-in-time (JIT) compilers in the Linux kernel [1]. The tool is based on SMT solving, but axiomatizes a set of expensive bitvector operations (mul/udiv/urem) to scale verification. We prove the axiomatization separately in Lean. For example, The tool decomposes 64-bit multiplication using 32-bit operations (e.g., for JIT compilers on 32-bit architectures), and we have a theorem mul_decomp for that axiomatization [2].

Just wanted to share the bitvector definitions and lemmas we have so far, and see if there's interest in providing a general bitvector library in mathlib.

The current implementation is:

https://github.com/uw-unsat/serval-bpf/tree/master/lean/src/bv

It largely follows the SMT-LIB standard.

We have also tried three other implementations before switching to the current one:

[1] https://github.com/uw-unsat/serval-bpf
[2] https://github.com/uw-unsat/serval-bpf/blob/master/lean/src/bvaxioms.lean

Simon Hudon (Aug 17 2020 at 01:21):

What would you say were the main pain points with data.bitvec?

Xi Wang (Aug 17 2020 at 03:59):

Simon Hudon said:

What would you say were the main pain points with data.bitvec?

The main reason is for specification: data.bitvec defines operations using circuits, while we prefer the definitions to match those in the SMT-LIB standard:

http://smtlib.cs.uiowa.edu/theories-FixedSizeBitVectors.shtml

For example, add in data.bitvec is defined as an adder:

https://github.com/leanprover-community/lean/blob/master/library/data/bitvec.lean#L90

The SMT-LIB defines it as: [[(bvadd s t)]] := nat2bv[m](bv2nat([[s]]) + bv2nat([[t]]))

https://github.com/uw-unsat/serval-bpf/blob/master/lean/src/bv/basic.lean#L164

For proof, the current implementation represents a bitvector as fin n → bool, which makes induction a bit simpler than using data.bitvec, but not fundamentally different (e.g., we prove ring for bitvector neg, add, mul for both representations). For example, data.bitvec stores the most significant bit at position 0 (rather than the least significant bit), and some proofs need to append a bit to the end of a vector for induction. Maybe reverse induction could simplify it.

Mario Carneiro (Aug 17 2020 at 04:03):

I have an implementation of bitvectors as well, which I needed for a project. I took the definition to be vector bool n, but with the bits in the opposite order from data.bitvec, and with operations defined via nat like you indicated

Xi Wang (Aug 17 2020 at 04:08):

Nice! We did exactly that in one version first (vector bool with the opposite order, defined over nat).

To further simplify induction (esp. to do induction over both n and the list of bits at the same time), we tweaked that into this inductive type (similar to vector in Coq):

https://github.com/uw-unsat/serval-bpf/blob/master/lean/attic/bitvec-ind/bv/basic.lean#L32

Mario Carneiro (Aug 17 2020 at 04:09):

https://github.com/digama0/mm0/blob/master/mm0-lean/x86/bits.lean

Mario Carneiro (Aug 17 2020 at 04:10):

lemmas in https://github.com/digama0/mm0/blob/master/mm0-lean/x86/lemmas.lean

Mario Carneiro (Aug 17 2020 at 04:14):

I think you can do better if you avoid doing induction at all, and instead use zmod to get all the properties. The dependent type approach means that you have to do induction for everything

Mario Carneiro (Aug 17 2020 at 04:16):

But I think this indicates that it is worthwhile to get a bitvector library into mathlib, and data.bitvec is not an appropriate basis

Xi Wang (Aug 17 2020 at 04:16):

One thing we are still not sure is whether to use nat or num for the underlying representation.

For example, if we use num for definitions:

  def add (v₁ v₂ : bv n) : bv n :=
    bv.of_num n (v₁.to_num + v₂.to_num)

Then we could do computational proofs:

  example : (0x1234 : bv 16) * (0x5678 : bv 16) = (0x60 : bv 16) := rfl

Using nat feels simpler for definitions, but computational proofs using rfl/dec_trivial won't work anymore. Ideally one could extend norm_num to support it I guess? I vaguely remember someone was asking about zmod a while back.

Mario Carneiro (Aug 17 2020 at 04:19):

You shouldn't rely on rfl for this. I actually do a decent amount of computation on concrete bitvectors in that file I linked, but I use typeclasses to transfer the problems to a more easily evaluated setup (see the reify function in x86/lemmas.lean)

Mario Carneiro (Aug 17 2020 at 04:20):

Basically, you can take a number like 0x1234 : bv 16 and reflect on the syntax to turn it into a list of bools, which can then be evaluated in the kernel

Xi Wang (Aug 17 2020 at 04:21):

Mario Carneiro said:

I think you can do better if you avoid doing induction at all, and instead use zmod to get all the properties. The dependent type approach means that you have to do induction for everything

Right, we thought about using zmod. The issue here is there are two types of bitvector operations: arithmetic ones (add, sub, mul, etc.) could be replaced using existing operations over zmod, but that's less clear for bit operations (concat, extract), as they manipulate bits directly. So we are stuck with fin n → bool for now.

Xi Wang (Aug 17 2020 at 04:21):

Mario Carneiro said:

But I think this indicates that it is worthwhile to get a bitvector library into mathlib, and data.bitvec is not an appropriate basis

Agreed! That would be great.

Mario Carneiro (Aug 17 2020 at 04:22):

Even better would be to extend norm_num, but that's tricky to do in a third party project

Mario Carneiro (Aug 17 2020 at 04:22):

Note that nat already has bit operations on it

Mario Carneiro (Aug 17 2020 at 04:23):

concat is pretty natural, it is the function 2^m * a + b, and extract is test_bit i a using the nat.test_bit function

Xi Wang (Aug 17 2020 at 04:24):

Mario Carneiro said:

You shouldn't rely on rfl for this. I actually do a decent amount of computation on concrete bitvectors in that file I linked, but I use typeclasses to transfer the problems to a more easily evaluated setup (see the reify function in x86/lemmas.lean)

Thanks for the pointer! Will take a look. We don't really need to evaluate concrete bitvectors in this project, but we might in the next one. That's why it's not urgent for us now, but I'm definitely interested.

Mario Carneiro (Aug 17 2020 at 04:25):

actually I didn't need to compute with bitvectors exactly, I needed to prove that e.g. 0x67 does not match the pattern [0,1,x,y,0,z,0,0]

Mario Carneiro (Aug 17 2020 at 04:26):

doing that naively in the kernel is a complete non-starter, at least with the representation I used

Xi Wang (Aug 17 2020 at 04:26):

Right, nat has bit operations. But they are quite different from the SMT-LIB definitions (and may have subtle differences in semantics). Since we are modeling SMT operations in this project in Lean, it's easier to keep our definitions closer to the SMT-LIB ones.

Mario Carneiro (Aug 17 2020 at 04:27):

The usual mathlib way is to prove that the singular definition matches the expected semantics. Do you actually have reason to believe that they are different?

Xi Wang (Aug 17 2020 at 04:32):

Aha, I've been curious about this high-level question. I notice some style differences in Coq and Lean/mathlib code. For example, Coq's Z is defined using a representation efficient for kernel computation. Lean prefers to use a simpler definition (based on nat) and one can use things like norm_num if kernel computation is needed.

For bitvector, one approach is to take circuit-based definitions (tweaking data.bitvec) or num-based definitions, and prove these definitions match the SMT-LIB semantics. Another approach is to define operations using nat following SMT-LIB, and extend norm_num to support kernel computation, etc. Any thoughts?

Mario Carneiro (Aug 17 2020 at 04:35):

My inclination would be to make all the definitions mathematically as simple as possible. So e.g. addition is bitvec.add x y := of_nat (to_nat a + to_nat b), there are maps to nat and int for signed and unsigned values and slt, ult are defined by pulling back < on these types

Mario Carneiro (Aug 17 2020 at 04:37):

this is the kind of definition that is best situated to answer the correctness question, and it's also the easiest to prove properties about

Mario Carneiro (Aug 17 2020 at 04:38):

Then the existence of an algorithm to compute the values (the circuit definition) becomes a real theorem

Xi Wang (Aug 17 2020 at 04:44):

The current bitvec defs seem to match your description? https://github.com/uw-unsat/serval-bpf/blob/master/lean/src/bv/basic.lean

Mario Carneiro (Aug 17 2020 at 04:49):

Yes, pretty much. I would argue that the definition of slt isn't great though

Mario Carneiro (Aug 17 2020 at 04:49):

also ashr

Mario Carneiro (Aug 17 2020 at 04:50):

udiv and urem look a little weird but I'm sure you got the special cases from smt-lib

Xi Wang (Aug 17 2020 at 04:50):

lol agreed - slt and ashr are directly copied from the SMT-LIB standard.

Xi Wang (Aug 17 2020 at 04:51):

We do prove something like this:

lemma slt_to_int (v₁ v₂ : bv (n + 1)) :
  v₁.to_int < v₂.to_int = v₁.slt v₂

https://github.com/uw-unsat/serval-bpf/blob/master/lean/src/bv/lemmas.lean#L651

Xi Wang (Aug 17 2020 at 04:53):

I meant this part from SMT-LIB http://smtlib.cs.uiowa.edu/Logics/QF_BV.smt2

Mario Carneiro (Aug 17 2020 at 04:55):

well in a file like that, I would guess that the version v₁.to_int < v₂.to_int = v₁.slt v₂ isn't available in the lexicon

Mario Carneiro (Aug 17 2020 at 04:57):

If I were being really rigorous about the translation, I would take that file and put it directly in lean, using an automated translation (with different possible levels of rigor here) to turn it into a lean theorem, that can then be proven from the lean definitions

Mario Carneiro (Aug 17 2020 at 04:58):

that way it decouples the lean definitions from the smt-lib definitions without losing on correctness

Xi Wang (Aug 17 2020 at 04:59):

Yeah that would be cool. :)

Mario Carneiro (Aug 17 2020 at 05:00):

I think @Daniel Fabian is working on an SMT-lib parser, which could make it possible to even read that file directly into a lean data structure

Daniel Fabian (Aug 17 2020 at 12:15):

That's right, as of now, I have an s-expression parser and serializer, a z3 interface and am working adding higher-level constructs of smtlib like commands.

One could use the current thing as is and even call z3 with it (I have test files doing that); the communication would then have to be in terms of s-expressions for now. Once I get further in the development, I certainly envision an interface in terms of "commands" and "responses" which would be lean inductive types.


Last updated: Dec 20 2023 at 11:08 UTC