Zulip Chat Archive

Stream: Lean for Scientists and Engineers 2024

Topic: Project ideas

Rick de Wolf (Aug 15 2024 at 16:48):

Hi all,
This week there was some discussion at the end of the lecture about people wanting to work on some projects using their new Lean skills :big_smile:. This thread is a good place to discuss these topics together, to see what other people are interested in and maybe team up as well. Hopefully we'll have lots of interesting ideas!

Rick de Wolf (Aug 15 2024 at 16:51):

Personally I am very interested in topics related to probability theory. More specifically I'm interested in applications to information theory and complex networks. I haven't had a chance to look into what is available in Lean at the moment, but I wouldn't mind working on implementing more basic theorems or tactics.

If you're interested in this, I'd love to chat about it :big_smile:

Martin Dvořák (Aug 16 2024 at 08:33):

What about formalizing the definition if AIXI?
https://en.wikipedia.org/wiki/AIXI
Would it be a meaningful project?

Rick de Wolf (Aug 16 2024 at 09:51):

That's an ambitious project :big_smile: I have experience in reinforcement learning so I would be interested in seeing where this goes! Do you have any concrete ideas for a starting point?

Rick de Wolf (Aug 16 2024 at 09:52):

I'm not sure if it's meaningful to you :) I'm personally less interested in machine learning oriented approaches these days, but if you want to see ML constructs formalized in Lean I'd say go for it!

Tyler Josephson ⚛️ (Aug 16 2024 at 11:24):

We’re interested in Markov chain Monte Carlo and the Metropolis Hastings algorithm. Related is the field of statistical mechanics, which connects with probability and statistics.

Not sure how much these intersect with the above interests.

Rick de Wolf (Aug 16 2024 at 12:37):

I have some limited experience with MCMC algorithms, but I would potentially be interested in joining the effort :). What kind of results are you interested in formalizing?

Arshak Parsa (Aug 16 2024 at 20:19):

I'm a statistics student, and I'm formalizing an open source probability book. I have formalized the first chapter in Lean, and I'm currently working on the second chapter. I try not to use measure theory because the book is not based on measure theory, so I want to keep the theorems as similar as their latex version (that's why I don't use mathlib definitions on probability and instead I try to define them according to the book).
You can formalize any parts of the book if you are interested (You can fork the project and do it in your own way or pull request, I don't know:sweat_smile:). I'm not planning to add my theorems to mathlib, but if you found any interesting results, feel free to modify and add to mathlib.

Arshak Parsa (Aug 16 2024 at 20:21):

I believe there must be a separate library for statistics theorems (called statlib, for instance:sweat_smile:)

Arshak Parsa (Aug 16 2024 at 20:23):

I'm also interested in formalizing some concepts of mathematical statistics in lean like sufficiency, unbiasedness, Fisher–Neyman factorization theorem, etc.
It would be great to have a separate channel for statistics and probability here on zulip.

Arshak Parsa (Aug 16 2024 at 20:28):

And maybe a separate channel for lean4 related memes. I've made some memes already :sweat_smile:

Urja Kuppa (Aug 17 2024 at 18:41):

I'm interested in topics related to Data science. From what we have learned, I am thinking of using Lists and/or Structures as data structures for storing some data and to do something with it. But I have yet to figure how I'm going to use proofs for such a topic.
What ideas do you have for possible proofs or approaching the topic (Data science) in general in Lean?

Tyler Josephson ⚛️ (Aug 18 2024 at 01:01):

Rick de Wolf said:

I have some limited experience with MCMC algorithms, but I would potentially be interested in joining the effort :). What kind of results are you interested in formalizing?

We use MCMC (not in Lean) to simulate properties of molecules. On Tuesday, I'll be sharing about LeanMD; LeanMC would be related.

The original Monte Carlo simulation was actually for a chemistry application!
Metropolis53-EOS-with-fast-computing-machines.pdf

Tyler Josephson ⚛️ (Aug 18 2024 at 01:11):

Arshak Parsa said:

I believe there must be a separate library for statistics theorems (called statlib, for instance:sweat_smile:)

You can certainly build your own libraries, but I suspect that integrating the work with Mathlib would make your contribution much easier for others to use and build on.

That said, others have worked on formalizing whole textbooks - especially for a well-known textbook, having a parallel Lean version can be neat!

For example "How to Prove It" has a companion "How to Prove it in Lean."

Zulip discussion on that here.

Tyler Josephson ⚛️ (Aug 18 2024 at 01:15):

Arshak Parsa said:

I'm a statistics student, and I'm formalizing an open source probability book. I have formalized the first chapter in Lean, and I'm currently working on the second chapter. I try not to use measure theory because the book is not based on measure theory, so I want to keep the theorems as similar as their latex version (that's why I don't use mathlib definitions on probability and instead I try to define them according to the book).
You can formalize any parts of the book if you are interested (You can fork the project and do it in your own way or pull request, I don't know:sweat_smile:). I'm not planning to add my theorems to mathlib, but if you found any interesting results, feel free to modify and add to mathlib.

I don't know much about measure theory, but my impression is that this is the way (maybe the only way?) to properly ground statistics and probability in the axioms of mathematics. My guess is that it should be possible to express probability and statistics at a high level that is nonetheless grounded in the measure-theory in Mathlib, but in a way that abstracts away the tedious details so the user only interacts with familiar notions.

We really struggled with calculus proofs over the last couple years, because, of course, differentiation is built on limits, but these need to be filters to make them rigorous, and so you need to understand sets... For an engineer, that's a lot to learn that's not part of one's training - it should be possible to just prove things about derivatives! But then folks like @Tomas Skrivan build things like fun_prop which automate away the tedium, and it gets a lot easier to use.

I hope we see progress along these lines in probability and statistics, which in my view suffers from a similar situation with measure theory.

Tyler Josephson ⚛️ (Aug 18 2024 at 01:24):

Urja Kuppa said:

I'm interested in topics related to Data science. From what we have learned, I am thinking of using Lists and/or Structures as data structures for storing some data and to do something with it. But I have yet to figure how I'm going to use proofs for such a topic.
What ideas do you have for possible proofs or approaching the topic (Data science) in general in Lean?

This is a really cool idea! Could be a lot of possibilities. Hopefully the Tuesday lecture will give you some inspiration - we formalized linear regression.

Keep in mind the distinction between "proofs" and "calculations" that I made early on (Lecture 2, I think?). Lean proofs don't "solve for x" - if you want to do that, you most likely need to write the code for the algorithm that does (but Lean can help you prove that the algorithm is correct). Also, I think the more promising projects are those that can be successful without needing efficient code execution / processing large amounts of data. You'll see some of our preliminary results around that on Tuesday.

Tyler Josephson ⚛️ (Aug 18 2024 at 01:33):

Martin Dvořák said:

What about formalizing the definition if AIXI?
https://en.wikipedia.org/wiki/AIXI
Would it be a meaningful project?

Interesting! Just learning about this now. Looks connected to reinforcement learning, but with lots of theoretical guarantees - neat project! I wonder if Lean could both simulate the behavior of such an agent, as well as have proofs accompanying its definition.

mars0i (Aug 18 2024 at 04:42):

For the record there is at least one alternative proposed foundation for probability theory, Shafer and Vovk's game-theoretic foundations: http://www.probabilityandfinance.com/2019_book/index.html . (One can ignore the world "finance" in the title, which only describes some later chapters.) This book and its predecessor have had an influence in some circles. But measure theory is the standard foundation for probability theory. It's pervasive and imo, beautiful. Shafer and Vovk's way is very interesting, though, allowing some generalizations of probability that are perhaps less natural with a measure-theoretic foundation.

I wonder whether game-theoretic probability might more easily allow constructive definitions for probability. This is pure speculation on my part--I don't know enough to know or even to start to think it through. But I would guess that you have to do a lot of work in constructive math to get to basic theorems of measure theory before they can be applied to probability. Maybe game-theoretic probability would provide a more direct way of proving theorems in probability theory. However, measure theoretic foundations of probability are well-known to lots of people. Game-theoretic probability is not, and it's very different.

Rick de Wolf (Aug 18 2024 at 09:42):

@Tyler Josephson ⚛️ Formalizing linear regression sounds like a very interesting project. I'll be out of town during Tuesday's lecture but I'm looking forward to watching the recording!

Rick de Wolf (Aug 18 2024 at 09:55):

@mars0i That's an amazing find! I'll be adding this book to my reading list :smiley:

Rick de Wolf (Aug 18 2024 at 10:15):

It's great to see all of you being interested in topics related to probability theory, statistics, and data science. It seems like we each want to go in different directions, but I think it would be great to share progress with the others. We might be able to help each other out and learn from each other's projects since a lot of it has probability theory at its core.

Rick de Wolf (Aug 18 2024 at 10:21):

The project I am interested in is to formalize information theory based on the book Elements of Information Theory by Cover and Thomas. Information theory involves a lot of non-measure theoretic probability theory, so I hope to improve some of Lean's probability tools along the way :). Getting some feedback on existing probability and measure theory tools would be very nice :)

Eric Taucher (Aug 18 2024 at 17:53):

I am mostly replying because being new to a forum and not getting any reply is a bit of a let down.

Urja Kuppa said:

I'm interested in topics related to Data science. From what we have learned, I am thinking of using Lists and/or Structures as data structures for storing some data and to do something with it. But I have yet to figure how I'm going to use proofs for such a topic.

My advice would be to just try with a simple problem and then when Lean complains try asking an LLM such as ChatGPT or Claude for help and/or asking here. You might actually be able to do some simple practical code without creating new proofs because you did not create any new structure, etc. Remember that much of the existing functions and structures already have the needed proofs.

Urja Kuppa said:

What ideas do you have for possible proofs or approaching the topic (Data science) in general in Lean?

If that were my problem I would get some algorithms and translate them into Lean.

Dictionary of Algorithms and Data Structures
NIST Digital Library of Mathematical Functions
The Stony Brook Algorithm Repository
StackOverflow tag Language Agnostic - Contains Free Language Agnostic Programming Books

mars0i (Aug 19 2024 at 00:40):

I'm not trying to formalize anything at the moment, but a collaborator and I are exploring using dependently typed programming for writing simulations of some processes in biological evolution. The project is at an early stage, and I'm still learning dependently typed methods. At the moment what we've done is in Agda, but we have been planning to rewrite it in Lean.

Tyler Josephson ⚛️ (Aug 19 2024 at 01:59):

@Urja Kuppa I’ve been thinking about your question over the weekend:
Are you familiar with pandas in Python? What do you think about creating something like that in Lean? I don’t know if anyone has made a “DataFrame” data structure in Lean for storing tables of data along with strings for column names, but something like that could facilitate a lot of follow-on data science applications.

Tyler Josephson ⚛️ (Aug 19 2024 at 02:01):

The formalized linear regression we’ve implemented is also quite rudimentary - only for the y=m*x+b case, no generalization to multiple linear regression. That could be a project for someone that would be quite useful.

Tyler Josephson ⚛️ (Aug 19 2024 at 02:03):

Data visualization is a major part of data science. I don’t foresee that being a priority for doing in Lean since I don’t see much opportunities for proofs / formalization in that area, and there are plenty of external libraries for making aesthetic figures.

Tyler Josephson ⚛️ (Aug 19 2024 at 02:07):

But - it might be neat to have a figure appear in the infoview. And those making Lean for game-related applications implement various kinds of visualizations for Sudoku, Mancala, etc. - might be some prior work and people to talk to if that were a direction to go.

Last updated: May 02 2025 at 03:31 UTC