Zulip Chat Archive

Stream: std4

Topic: SAT core defs in Std?


James Gallicchio (Jan 03 2023 at 10:14):

Thoughts on putting the core definitions for SAT (CNF formulas, satisfiability, etc.) in Std?

It's a bit outside the scope. I personally would prefer to put it in a separate library, but I'd like us to settle on one set of core defs for all SAT-related projects, for easy interop.

A non-exhaustive list of (not nec. equivalent) core SAT definitions from different Lean4 projects:
My encoding library: https://github.com/JamesGallicchio/eternity2/blob/main/lean/Eternity2/SATSolve/EncCNF.lean
Mario's LRAT checker: https://github.com/leanprover-community/mathlib4/blob/master/Mathlib/Tactic/Sat/FromLRAT.lean
Joe's LRAT checker: https://github.com/joehendrix/lean-sat-checker/blob/main/SatChecker/Core.lean
Siddhartha's pure Lean SAT solver: https://github.com/siddhartha-gadgil/Saturn/blob/main/Saturn/Core.lean
Wojciech's (abandoned?) SAT core library: https://github.com/Vtec234/lean-sat/blob/main/LeanSat/Data/Cnf.lean

Also some lean3 projects
Cayden's encoding library: https://github.com/ccodel/verified-encodings/blob/master/src/cnf/clause.lean
Ryan Greenblatt's verified DPLL solver: https://github.com/rgreenblatt/verified_sat/blob/master/src/types.lean

James Gallicchio (Jan 03 2023 at 10:19):

The main outlier I see is Cayden's, where the CNF formulas are parametric on the underlying variable type (which is Nat in all other instances). I think it's justified in the encoding library, but unnecessarily general for the other projects.

James Gallicchio (Jan 03 2023 at 10:23):

I also don't want to duplicate work with https://github.com/ufmg-smite/lean-smt, but I don't see any CNF-specific stuff there.

James Gallicchio (Jan 03 2023 at 10:24):

Anyways, if these defs don't fit in Std, I'll put together a separate repo for it.

Wojciech Nawrocki (Jan 03 2023 at 16:20):

I think that an obstacle to standardization is that depending on what you want to use the CNF for, there are different requirements on the representation. For example List Clause is okay if you only ever need to fold the whole thing but would be absurdly inefficient for a SAT solver. So realistically the library would need to include something like PersistentCnf and Cnf, like List vs Array. But even then, what if you need a clause-database type representation which isn't an Array but some more specific data structure? I think the one thing we could all agree on is the mathematical model which would just be some inductive datatype of propositional formulas :)

Wojciech Nawrocki (Jan 03 2023 at 16:25):

James Gallicchio said:

The main outlier I see is Cayden's, where the CNF formulas are parametric on the underlying variable type (which is Nat in all other instances). I think it's justified in the encoding library, but unnecessarily general for the other projects.

If you want to emit efficient code, it seems the thing to do is parameterize over the literal type e.g. so that it can be UInt64 and the first bit denotes the polarity. Parameterizing over the variable type forces a representation Literal ν which afaik Lean will represent as boxed which is a lot of pointer dereferencing when these are accessed often.

James Gallicchio (Jan 03 2023 at 18:07):

I would mainly want to agree on a mathematical model, but preferably one that is also reasonably efficient for any sort of program interoperation between projects.

James Gallicchio (Jan 03 2023 at 18:08):

Or, maybe if we get a good collections interface in Std, we can parameterize the CNF type by the collection types...

Mario Carneiro (Jan 03 2023 at 21:25):

I agree with Wojciech, this is an area where you can't really standardize the data representation because that depends a lot on what the solver is doing or how it works. The only thing I would expect could be shared are things like List.all for defining the model

James Gallicchio (Jan 04 2023 at 07:38):

I don't think Lean would be viable for writing high-performance SAT solvers at the moment, so I wasn't really thinking about that as an application. I'm moreso looking for us to maintain something akin to PySAT which is a library for encoding to CNF and interfacing with external SAT solvers.

Maybe for now I just work on building out the PySAT equivalent library, and then we can see if it's a good place for mathlib's LRAT verifier and Cayden's verified encodings to be housed.

James Gallicchio (Jan 04 2023 at 07:39):

(And any other SAT-adjacent projects that we want to maintain longterm)

Kevin Buzzard (Jan 04 2023 at 08:50):

I know nothing about programming but out of interest why do you not think lean would be viable for writing high-performance SAT solvers?

Reid Barton (Jan 04 2023 at 08:52):

Maybe what we want is a type class that explains how to decode some type X into a SAT instance (with variables of type a)

Mario Carneiro (Jan 04 2023 at 08:55):

that does sound more promising, although I still wonder what kind of common interface we can give to the different SAT-like types

Mario Carneiro (Jan 04 2023 at 08:59):

Kevin Buzzard said:

I know nothing about programming but out of interest why do you not think lean would be viable for writing high-performance SAT solvers?

We can probably get decently close (i.e. within an order of magnitude) to high performance SAT solvers with lean code if the compiler doesn't fail us, but to get the last bit of performance requires more control over data layout than we have any intention of providing in pure lean code. In this regime the next best thing is to link to a C function or library and axiomatize its behavior using implemented_by; you could still get pretty close to full verification that way if the model is fine-grained enough.

Tobias Grosser (Jan 04 2023 at 11:21):

I actually would be super interested in getting lean faster than C on such codes and have thought about this quite a bit. A lot of the vectorization/performance tricks that we apply in https://dl.acm.org/doi/10.1145/3428263 and https://dl.acm.org/doi/10.1145/3485539 could potentially work better in lean.

Henrik Böving (Jan 04 2023 at 12:33):

Tobias Grosser said:

I actually would be super interested in getting lean faster than C on such codes and have thought about this quite a bit. A lot of the vectorization/performance tricks that we apply in https://dl.acm.org/doi/10.1145/3428263 and https://dl.acm.org/doi/10.1145/3485539 could potentially work better in lean.

Why do you believe they could work better in Lean?

Tobias Grosser (Jan 04 2023 at 15:09):

In C++ we face issues with (a) pointer aliasing, (b) partial evaluating the type-specific variants - the templates become crazy, (c) it is hard to supply proofs that certain low-precision types are sufficient, (d) and we cannot synthesis vector code effectively from C++. Lean's pure nature makes pointer aliasing simple to work with and lean is more modular and flexible such that b&d can be addressed, and (c) is what lean is anyway good at.

Tobias Grosser (Jan 04 2023 at 15:10):

I feel we could build e..g something like https://terralang.org/ in lean with an LLVM/MLIR backend.

Tobias Grosser (Jan 04 2023 at 15:11):

Also, the macro system in lean will make many of these things a lot easier, and the interactive proof mode could probably be used to make this process more approachable from a human perspective.

Tobias Grosser (Jan 04 2023 at 15:13):

Also, if you look at how the simplex algorithm is implemented, it is really rewriting matrices in-place. Hence, the initial algorithm can be nicely written in a purely functional style yet the generated code could be fully in place. If we then find a good way to (a) specialize the code according to types and (b) switch between those variants with little overhead we are good.

Tobias Grosser (Jan 04 2023 at 15:14):

Given that we now have an LLVM backend we could implement sth like the clang/gcc vector extensions: https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors

Tobias Grosser (Jan 04 2023 at 15:14):

Which expose fast SIMD code to the programmers.

Tobias Grosser (Jan 04 2023 at 15:15):

Compilers such as GHC do not have a good SIMD backend, so they have a hard time implementing these.

Tobias Grosser (Jan 04 2023 at 15:16):

LLVM also has excellent support for Checked arithmetic builtins and even SIMD versions of those, which again is sth that we could use.

Tobias Grosser (Jan 04 2023 at 15:18):

Finally, and now we go crazy (but I like that) we could even encode rationals as floats/doubles. ideas_fast-ap-integers.md-at-main-opencompl_ideas.pdf

Tobias Grosser (Jan 04 2023 at 15:19):

And again, to decide when a float is precise and when we need to fall-back to arbitrary precision rationals we certainly would benefit from lean as a language.

Andrés Goens (Jan 04 2023 at 16:55):

Tobias Grosser said:

I feel we could build e..g something like https://terralang.org/ in lean with an LLVM/MLIR backend.

that's pretty interesting :thinking: @Tobias Grosser do you know this well? I'm curious how this differs from Low*, or is it very similar?

Tobias Grosser (Jan 04 2023 at 16:58):

I guess it's similar. What's nice about Terra is that it integrates with llvm which means you get well optimized code that is JIT compiled without going through C and you can also have access of llvms vector abstractions.

Tobias Grosser (Jan 04 2023 at 16:59):

What makes Terra special is that you can use Lua variables from terras low level code and Lua can access Terra variables. It's very dynamic yet fast.

Tobias Grosser (Jan 04 2023 at 17:00):

I feel in lean we could do the same, but maybe even nicer with dynamic syntax extensions, eventually exposing mlir abstractions,...

Tobias Grosser (Jan 04 2023 at 17:01):

If this goes well we could give the lean users access to very low level abstractions, yet allowing for fast code to be generated. This will also help scilean and such.

Andrés Goens (Jan 04 2023 at 17:02):

It probably would require some kind of compiler support to treat these low-level abstractions specially in code generation, right?

Tobias Grosser (Jan 04 2023 at 17:02):

Lean already allows for annotations for using c.

Tobias Grosser (Jan 04 2023 at 17:03):

We could add annotations for using custom llvm and MLIR as implementation of a lean function.

Tobias Grosser (Jan 04 2023 at 17:03):

And then some macro magic to provide a lean api to end-users.

Tobias Grosser (Jan 04 2023 at 17:04):

Coincidentally we already have an llvm backend, so what's missing is mostly macro magic.

Andrés Goens (Jan 04 2023 at 17:06):

I guess also the right abstractions if we'd want to reason about that code

Tobias Grosser (Jan 04 2023 at 17:06):

Do you know anybody who could be interested in collaborating on the macro aspects of this?

Henrik Böving (Jan 04 2023 at 17:08):

Tobias Grosser said:

We could add annotations for using custom llvm and MLIR as implementation of a lean function.

I mean this would just be an attribute with special support in the LLVMEmitter in the Lean code generator no?
Tobias Grosser said:

And then some macro magic to provide a lean api to end-users.

If we have Lean functions tagged as implement by LLVM/MLIR what additional macro magic do we need?

Tobias Grosser (Jan 04 2023 at 17:09):

Andrés Goens said:

I guess also the right abstractions if we'd want to reason about that code

Maybe. I feel that this could be orthogonal. It would be nice if this is verified, but I am unsure finding canonical semantics for everything will be easy. The current c annotations also do not come with semantics and are already very useful. Offering a way to add verification seems to be beneficial as an opt in.

Tobias Grosser (Jan 04 2023 at 17:12):

Henrik Böving said:

Tobias Grosser said:

We could add annotations for using custom llvm and MLIR as implementation of a lean function.

I mean this would just be an attribute with special support in the LLVMEmitter in the Lean code generator no?
Tobias Grosser said:

And then some macro magic to provide a lean api to end-users.

If we have Lean functions tagged as implement by LLVM/MLIR what additional macro magic do we need?

In theory we could supply a string of llvm-it that the llvm emitter than translates. This is easy and maybe just a minor change. I feel it would be nice if we would actually parse the LLVM-IR into lean data structure to be able to check certain properties early on and also to be able to eventually attach semantics. But maybe that is indeed step two?

Henrik Böving (Jan 04 2023 at 17:17):

I mean the C stuff is just parsed as strings right now as well :D there was a time where we even had full C expressions instead of just function support.

What kind of semantics are you thinking about here?

Andrés Goens (Jan 04 2023 at 17:19):

I guess in the end it boils down to how superficial the Lean data structure is that it represents it. A data structure that represents the LLVM/MLIR syntax is surely an improvement over a string, but I'd argue not a terribly big one. If we'd want some sort of semantic representation of the LLVM/MLIR code to be able to e.g. typecheck some of the code, which I think we'd need for your points (a) and (b) perhaps above, although I don't fully understand those. For your point (c) a semantic embedding sounds indispensable, whereas for (d) this probably would already be very useful just to be able to code generate something opaque.

Tobias Grosser (Jan 04 2023 at 17:23):

https://github.com/opencompl/lean-mlir shows our experiments on a lean-based parser for MLIR as well as ideas to build modular semantics. It's very much work-in-progress, so many of these questions are unanswered. However, I am sure @Siddharth Bhat and @Sébastien Michelland and others have ideas to share. I am happy to just start with a string-to-llvm annotation and feel there could be some very low-hanging fruits, e.g. a (type-generic) SIMD implementation for lean?

Tobias Grosser (Jan 04 2023 at 17:23):

That allow for manual SIMDization.

Tobias Grosser (Jan 04 2023 at 17:24):

I feel adding a string-only LLVM annotation as a first step could be a nice start and we could improve on this incrementally. I would guess such a patch is < 100 LOC.

Reid Barton (Jan 04 2023 at 17:26):

(maybe this conversation could be split off into a new topic?)

Tobias Grosser (Jan 04 2023 at 17:27):

(yes, I don't know how to. This belongs into lean4 dev or at the least into a separate thread. I don't want to experiment to not break this long thread. Anybody knows how to split threads?)

James Gallicchio (Jan 04 2023 at 18:31):

I think you need admin permissions... @Sebastian Ullrich @Gabriel Ebner ? (saw you online; this discussion might also be of interest to the dev team)

Notification Bot (Jan 04 2023 at 20:34):

35 messages were moved from this topic to #lean4 dev > Lean LLVM/MLIR by Sebastian Ullrich.

James Gallicchio (Jan 15 2023 at 12:41):

Okay, first commit of LeanSAT has dropped :) it's currently just what I had already written for another project, but it is already pretty nice to build encodings with

James Gallicchio (Jan 15 2023 at 12:42):

I currently plan to add docs, more examples, and an IPASIR interface with (hopefully) more solvers.

James Gallicchio (Jan 15 2023 at 12:45):

And then I'll probably go in the direction of Reid's suggestion and add a SAT encodable/decodable typeclass

James Gallicchio (Jan 15 2023 at 12:46):

LMK if there's something else you'd like to see in it!

Mario Carneiro (Jan 15 2023 at 12:47):

a test suite / CI would be nice

James Gallicchio (Jun 26 2023 at 00:50):

James Gallicchio said:

The main outlier I see is Cayden's, where the CNF formulas are parametric on the underlying variable type (which is Nat in all other instances). I think it's justified in the encoding library, but unnecessarily general for the other projects.

I've been steadily improving the library and building more encodings, and I am becoming convinced that Cayden is totally correct that CNFs should be parametrized on the variable type, in particular so that you can differentiate temporary/autogenerated variables from meaningful variables. As it turns out I've encountered many of the same issues as I did when I was trying to formalize register machines :) we're composing graphs, essentially, and having the types reflect that is helpful.

James Gallicchio (Jun 26 2023 at 00:53):

I also now thoroughly believe I want the CNF type to just be parametric on collection type-- it is very helpful to have a single shared type for shuffling CNFs around various tools, but I am starting to hit the limitations of using List, and I never do any operations other than cons/append and fold...

Mac Malone (Jun 28 2023 at 02:15):

Mario Carneiro said:

to get the last bit of performance requires more control over data layout than we have any intention of providing in pure lean code

What makes you think this? I know that in the compiler discussions there was a lot of consideration of features that would give users much more fine grain control of data layout. So, I definitely could see pure Lean reaching this point eventually.

Mario Carneiro (Jun 28 2023 at 02:34):

For instance, I don't see lean losing the refcounts any time in the foreseeable future

Mario Carneiro (Jun 28 2023 at 02:34):

and that adds a bunch of boxing overhead to every abstraction

James Gallicchio (Jun 28 2023 at 03:13):

Mario Carneiro said:

For instance, I don't see lean losing the refcounts any time in the foreseeable future

I thought this was one of the things on the long term roadmap for linear types...

James Gallicchio (Jun 28 2023 at 03:17):

Regardless, I don't currently have and don't plan to have concrete implementations of special data structures in LeanSAT, just interfaces to common formats and such

Tobias Grosser (Jun 29 2023 at 07:53):

I certainly would be interested in learning if/how we can get more performance out of lean.

Tobias Grosser (Jun 29 2023 at 07:54):

Probably not 100% related, but @Siddharth Bhat and @Henrik Böving work on an LLVM backend for Lean plus a codegen_by extension that allows LLVM-IR to be used to implement lean function.

Tobias Grosser (Jun 29 2023 at 07:54):

I would expect that this could offer fast SIMD code for certain data structures.


Last updated: Dec 20 2023 at 11:08 UTC