Zulip Chat Archive
Stream: Program verification
Topic: Bitvector
Xi Wang (Aug 16 2020 at 23:30):
Hi,
We have been working on a formal verification tool for the BPF just-in-time (JIT) compilers in the Linux kernel [1]. The tool is based on SMT solving, but axiomatizes a set of expensive bitvector operations (mul/udiv/urem) to scale verification. We prove the axiomatization separately in Lean. For example, The tool decomposes 64-bit multiplication using 32-bit operations (e.g., for JIT compilers on 32-bit architectures), and we have a theorem mul_decomp
for that axiomatization [2].
Just wanted to share the bitvector definitions and lemmas we have so far, and see if there's interest in providing a general bitvector library in mathlib.
The current implementation is:
https://github.com/uw-unsat/serval-bpf/tree/master/lean/src/bv
It largely follows the SMT-LIB standard.
We have also tried three other implementations before switching to the current one:
-
representing a bitvector as
nat
plus a range proof:
https://github.com/uw-unsat/serval-bpf/blob/master/lean/attic/bitvec-nat.lean -
using
data.bitvec
:
https://github.com/uw-unsat/serval-bpf/blob/master/lean/attic/bitvec-vector.lean -
representing a bitvector as an inductive type:
https://github.com/uw-unsat/serval-bpf/tree/master/lean/attic/bitvec-ind/bv
[1] https://github.com/uw-unsat/serval-bpf
[2] https://github.com/uw-unsat/serval-bpf/blob/master/lean/src/bvaxioms.lean
Simon Hudon (Aug 17 2020 at 01:21):
What would you say were the main pain points with data.bitvec
?
Xi Wang (Aug 17 2020 at 03:59):
Simon Hudon said:
What would you say were the main pain points with
data.bitvec
?
The main reason is for specification: data.bitvec
defines operations using circuits, while we prefer the definitions to match those in the SMT-LIB standard:
http://smtlib.cs.uiowa.edu/theories-FixedSizeBitVectors.shtml
For example, add
in data.bitvec
is defined as an adder:
https://github.com/leanprover-community/lean/blob/master/library/data/bitvec.lean#L90
The SMT-LIB defines it as: [[(bvadd s t)]] := nat2bv[m](bv2nat([[s]]) + bv2nat([[t]]))
https://github.com/uw-unsat/serval-bpf/blob/master/lean/src/bv/basic.lean#L164
For proof, the current implementation represents a bitvector as fin n → bool
, which makes induction a bit simpler than using data.bitvec
, but not fundamentally different (e.g., we prove ring for bitvector neg, add, mul for both representations). For example, data.bitvec
stores the most significant bit at position 0 (rather than the least significant bit), and some proofs need to append a bit to the end of a vector for induction. Maybe reverse induction could simplify it.
Mario Carneiro (Aug 17 2020 at 04:03):
I have an implementation of bitvectors as well, which I needed for a project. I took the definition to be vector bool n
, but with the bits in the opposite order from data.bitvec
, and with operations defined via nat like you indicated
Xi Wang (Aug 17 2020 at 04:08):
Nice! We did exactly that in one version first (vector bool with the opposite order, defined over nat).
To further simplify induction (esp. to do induction over both n and the list of bits at the same time), we tweaked that into this inductive type (similar to vector in Coq):
https://github.com/uw-unsat/serval-bpf/blob/master/lean/attic/bitvec-ind/bv/basic.lean#L32
Mario Carneiro (Aug 17 2020 at 04:09):
https://github.com/digama0/mm0/blob/master/mm0-lean/x86/bits.lean
Mario Carneiro (Aug 17 2020 at 04:10):
lemmas in https://github.com/digama0/mm0/blob/master/mm0-lean/x86/lemmas.lean
Mario Carneiro (Aug 17 2020 at 04:14):
I think you can do better if you avoid doing induction at all, and instead use zmod
to get all the properties. The dependent type approach means that you have to do induction for everything
Mario Carneiro (Aug 17 2020 at 04:16):
But I think this indicates that it is worthwhile to get a bitvector library into mathlib, and data.bitvec
is not an appropriate basis
Xi Wang (Aug 17 2020 at 04:16):
One thing we are still not sure is whether to use nat
or num
for the underlying representation.
For example, if we use num
for definitions:
def add (v₁ v₂ : bv n) : bv n :=
bv.of_num n (v₁.to_num + v₂.to_num)
Then we could do computational proofs:
example : (0x1234 : bv 16) * (0x5678 : bv 16) = (0x60 : bv 16) := rfl
Using nat
feels simpler for definitions, but computational proofs using rfl
/dec_trivial
won't work anymore. Ideally one could extend norm_num
to support it I guess? I vaguely remember someone was asking about zmod a while back.
Mario Carneiro (Aug 17 2020 at 04:19):
You shouldn't rely on rfl
for this. I actually do a decent amount of computation on concrete bitvectors in that file I linked, but I use typeclasses to transfer the problems to a more easily evaluated setup (see the reify
function in x86/lemmas.lean
)
Mario Carneiro (Aug 17 2020 at 04:20):
Basically, you can take a number like 0x1234 : bv 16
and reflect on the syntax to turn it into a list of bools, which can then be evaluated in the kernel
Xi Wang (Aug 17 2020 at 04:21):
Mario Carneiro said:
I think you can do better if you avoid doing induction at all, and instead use
zmod
to get all the properties. The dependent type approach means that you have to do induction for everything
Right, we thought about using zmod
. The issue here is there are two types of bitvector operations: arithmetic ones (add, sub, mul, etc.) could be replaced using existing operations over zmod
, but that's less clear for bit operations (concat, extract), as they manipulate bits directly. So we are stuck with fin n → bool
for now.
Xi Wang (Aug 17 2020 at 04:21):
Mario Carneiro said:
But I think this indicates that it is worthwhile to get a bitvector library into mathlib, and
data.bitvec
is not an appropriate basis
Agreed! That would be great.
Mario Carneiro (Aug 17 2020 at 04:22):
Even better would be to extend norm_num
, but that's tricky to do in a third party project
Mario Carneiro (Aug 17 2020 at 04:22):
Note that nat
already has bit operations on it
Mario Carneiro (Aug 17 2020 at 04:23):
concat is pretty natural, it is the function 2^m * a + b
, and extract is test_bit i a
using the nat.test_bit
function
Xi Wang (Aug 17 2020 at 04:24):
Mario Carneiro said:
You shouldn't rely on
rfl
for this. I actually do a decent amount of computation on concrete bitvectors in that file I linked, but I use typeclasses to transfer the problems to a more easily evaluated setup (see thereify
function inx86/lemmas.lean
)
Thanks for the pointer! Will take a look. We don't really need to evaluate concrete bitvectors in this project, but we might in the next one. That's why it's not urgent for us now, but I'm definitely interested.
Mario Carneiro (Aug 17 2020 at 04:25):
actually I didn't need to compute with bitvectors exactly, I needed to prove that e.g. 0x67
does not match the pattern [0,1,x,y,0,z,0,0]
Mario Carneiro (Aug 17 2020 at 04:26):
doing that naively in the kernel is a complete non-starter, at least with the representation I used
Xi Wang (Aug 17 2020 at 04:26):
Right, nat
has bit operations. But they are quite different from the SMT-LIB definitions (and may have subtle differences in semantics). Since we are modeling SMT operations in this project in Lean, it's easier to keep our definitions closer to the SMT-LIB ones.
Mario Carneiro (Aug 17 2020 at 04:27):
The usual mathlib way is to prove that the singular definition matches the expected semantics. Do you actually have reason to believe that they are different?
Xi Wang (Aug 17 2020 at 04:32):
Aha, I've been curious about this high-level question. I notice some style differences in Coq and Lean/mathlib code. For example, Coq's Z is defined using a representation efficient for kernel computation. Lean prefers to use a simpler definition (based on nat) and one can use things like norm_num
if kernel computation is needed.
For bitvector, one approach is to take circuit-based definitions (tweaking data.bitvec
) or num
-based definitions, and prove these definitions match the SMT-LIB semantics. Another approach is to define operations using nat
following SMT-LIB, and extend norm_num
to support kernel computation, etc. Any thoughts?
Mario Carneiro (Aug 17 2020 at 04:35):
My inclination would be to make all the definitions mathematically as simple as possible. So e.g. addition is bitvec.add x y := of_nat (to_nat a + to_nat b)
, there are maps to nat
and int
for signed and unsigned values and slt, ult are defined by pulling back <
on these types
Mario Carneiro (Aug 17 2020 at 04:37):
this is the kind of definition that is best situated to answer the correctness question, and it's also the easiest to prove properties about
Mario Carneiro (Aug 17 2020 at 04:38):
Then the existence of an algorithm to compute the values (the circuit definition) becomes a real theorem
Xi Wang (Aug 17 2020 at 04:44):
The current bitvec defs seem to match your description? https://github.com/uw-unsat/serval-bpf/blob/master/lean/src/bv/basic.lean
Mario Carneiro (Aug 17 2020 at 04:49):
Yes, pretty much. I would argue that the definition of slt
isn't great though
Mario Carneiro (Aug 17 2020 at 04:49):
also ashr
Mario Carneiro (Aug 17 2020 at 04:50):
udiv
and urem
look a little weird but I'm sure you got the special cases from smt-lib
Xi Wang (Aug 17 2020 at 04:50):
lol agreed - slt
and ashr
are directly copied from the SMT-LIB standard.
Xi Wang (Aug 17 2020 at 04:51):
We do prove something like this:
lemma slt_to_int (v₁ v₂ : bv (n + 1)) :
v₁.to_int < v₂.to_int = v₁.slt v₂
https://github.com/uw-unsat/serval-bpf/blob/master/lean/src/bv/lemmas.lean#L651
Xi Wang (Aug 17 2020 at 04:53):
I meant this part from SMT-LIB http://smtlib.cs.uiowa.edu/Logics/QF_BV.smt2
Mario Carneiro (Aug 17 2020 at 04:55):
well in a file like that, I would guess that the version v₁.to_int < v₂.to_int = v₁.slt v₂
isn't available in the lexicon
Mario Carneiro (Aug 17 2020 at 04:57):
If I were being really rigorous about the translation, I would take that file and put it directly in lean, using an automated translation (with different possible levels of rigor here) to turn it into a lean theorem, that can then be proven from the lean definitions
Mario Carneiro (Aug 17 2020 at 04:58):
that way it decouples the lean definitions from the smt-lib definitions without losing on correctness
Xi Wang (Aug 17 2020 at 04:59):
Yeah that would be cool. :)
Mario Carneiro (Aug 17 2020 at 05:00):
I think @Daniel Fabian is working on an SMT-lib parser, which could make it possible to even read that file directly into a lean data structure
Daniel Fabian (Aug 17 2020 at 12:15):
That's right, as of now, I have an s-expression parser and serializer, a z3 interface and am working adding higher-level constructs of smtlib like commands.
One could use the current thing as is and even call z3 with it (I have test files doing that); the communication would then have to be in terms of s-expressions for now. Once I get further in the development, I certainly envision an interface in terms of "commands" and "responses" which would be lean inductive types.
Last updated: Dec 20 2023 at 11:08 UTC