Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Concurrency and batch processing in pantograph

gambpang (Apr 24 2025 at 00:19):

Here is a brief conversation over email with @Leni Aniva that I'm moving here:

@gambpang I am a fan of your Pantograph software. Have you considered either a concurrent or batch oriented interface? I would like to make use of something like this and am happy to do some development. If there is already an effort in progress, I can pitch in with development or code review. If not, I could submit a patch to your repository.

@Leni Aniva Thanks for the interest! Python seems to favour multiprocessing compared to multithreading, so the current support for parallelism we have in PyPantograph is that you can pickle a goal and send it to another process as a file. Unifying goals after a proof finishes, however, is a difficult algorithmic problem and I plan to add this feature in the summer.

... continuing:

I had in mind to add concurrency in the Lean executable by exposing a TCP socket to handle commands concurrently across a tree of shared states. The specific functionality that I would like to support is to share a complex environment in memory among many threads and avoid start-up costs. For a first cut, there would be branching but not merging (at least not as an explicit feature).

Leni Aniva (Apr 24 2025 at 00:25):

so you want the compiled Lean executable (pantograph repl or smth) itself to act like a server?

Leni Aniva (Apr 24 2025 at 00:25):

One of Dr. Avigad's students at CMU tried that. It was really hard

Leni Aniva (Apr 24 2025 at 00:27):

Also I don't see what benefit this could provide over the current solution of having one instance per process. If a buggy tactic crashes Pantograph it would be hard to recover

Leni Aniva (Apr 24 2025 at 00:30):

My solution is to have multiple instances of whatever proof finding program you're running. If one instance crashes, use industry standard methods of error recovery and generate plenty of checkpoints. This is also the solution used by Amazon and Harmonic afaik

Justin Asher (Apr 24 2025 at 02:14):

Leni Aniva said:

Also I don't see what benefit this could provide over the current solution of having one instance per process. If a buggy tactic crashes Pantograph it would be hard to recover

Do you think it would be worth trying a load-balanced cluster where each node is a Lean server instance (using the TCP approach internally for efficiency within that node), coupled with frequent checkpointing? This would provide instance-level fault isolation, while still gaining some resource efficiency. I am mainly concerned about starting many server instances, like @gambpang mentioned, and the resulting high RAM and CPU usage.

gambpang (Apr 24 2025 at 02:37):

@Leni Aniva Thanks for the feedback! I'm curious what sort of workloads you've run in the one-instance per process setting. I'm thinking of trying to saturate my hardware by batching examples as much as possible. Also, I'm fairly new to this scene and building intuition. How common is it for a buggy tactic to crash Pantograph? Are there some especially buggy tactics to treat with more care?

Leni Aniva (Apr 24 2025 at 04:50):

Justin Asher said:

Leni Aniva said:

Also I don't see what benefit this could provide over the current solution of having one instance per process. If a buggy tactic crashes Pantograph it would be hard to recover

Do you think it would be worth trying a load-balanced cluster where each node is a Lean server instance (using the TCP approach internally for efficiency within that node), coupled with frequent checkpointing? This would provide instance-level fault isolation, while still gaining some resource efficiency. I am mainly concerned about starting many server instances, like gambpang mentioned, and the resulting high RAM and CPU usage.

By server do you mean LSP server or just the pantograph repl? the repl should have a much smaller footprint.

We can have load balancing but it should be done outside of Lean

Leni Aniva (Apr 24 2025 at 04:52):

gambpang said:

Leni Aniva Thanks for the feedback! I'm curious what sort of workloads you've run in the one-instance per process setting. I'm thinking of trying to saturate my hardware by batching examples as much as possible. Also, I'm fairly new to this scene and building intuition. How common is it for a buggy tactic to crash Pantograph? Are there some especially buggy tactics to treat with more care?

aesop used to be pretty buggy. I have the timeout feature as a safeguard in Pantograph, but a tactic can just ignore that. Ive had aesop crash my machine and max out the ram. simp could go into an infinite loop. My go-to is to avoid these tactics for now.

Justin Asher (Apr 24 2025 at 17:35):

Leni Aniva said:

Justin Asher said:

Leni Aniva said:

Also I don't see what benefit this could provide over the current solution of having one instance per process. If a buggy tactic crashes Pantograph it would be hard to recover

Do you think it would be worth trying a load-balanced cluster where each node is a Lean server instance (using the TCP approach internally for efficiency within that node), coupled with frequent checkpointing? This would provide instance-level fault isolation, while still gaining some resource efficiency. I am mainly concerned about starting many server instances, like gambpang mentioned, and the resulting high RAM and CPU usage.

By server do you mean LSP server or just the pantograph repl? the repl should have a much smaller footprint.

We can have load balancing but it should be done outside of Lean

That makes sense. I will have to play around with the REPL more. The Kimina Lean Server seems to implement this idea well (using pre-started Lean REPL workers). Thanks for the information.

Jesse Michael Han (Apr 26 2025 at 01:53):

hey @Justin Asher you might find this helpful: #Machine Learning for Theorem Proving > Serverless Pantograph on Morph Cloud @ 💬 - using the open-source template in that thread, it's quite easy to put a load balancer / orchestrator in front of our Pantograph wrapper and start from snapshots to handle incoming requests.

Justin Asher (Apr 26 2025 at 02:39):

@Jesse Michael Han This is super helpful! Thanks for letting me know. Using snapshots seems like the right approach here, given that you can quickly start VMs.

Last updated: May 02 2025 at 03:31 UTC