Zulip Chat Archive
Stream: new members
Topic: generic shannon entropy formalization mathlib4
Semar Augusto (Jan 27 2026 at 23:48):
Hello! I’ve been learning Lean for a few months and I’m starting to formalize some basic concepts from information theory. As a first step, I’m defining a generic Shannon entropy for PMF, since the existing file Mathlib/Analysis/SpecialFunctions/BinaryEntropy.lean focuses on Bernoulli distributions.
My current definition is:
public import Mathlib.Topology.Algebra.InfiniteSum.Basic
public import Mathlib.Topology.Algebra.InfiniteSum.ENNReal
public import Mathlib.Probability.ProbabilityMassFunction.Basic
public import Mathlib.Probability.ProbabilityMassFunction.Constructions
public import Mathlib.Probability.ProbabilityMassFunction.Monad
public import Mathlib.Analysis.SpecialFunctions.Log.NegMulLog
public import Mathlib.Analysis.SpecialFunctions.BinaryEntropy
public import Mathlib.Probability.Distributions.Uniform
public section
open scoped Finset MeasureTheory NNReal ENNReal
namespace InformationTheory
variable {α : Type*}
noncomputable def entropy (p : PMF α) : ℝ :=
∑' a : α, Real.negMulLog (ENNReal.toReal (p a))
I can prove that the entropy of a PMF is nonneg:
lemma entropy_nonneg (p : PMF α) : 0 ≤ entropy p := by
simp only [entropy]
refine tsum_nonneg fun a => ?_
refine Real.negMulLog_nonneg ?_ ?_
· exact ENNReal.toReal_nonneg
· refine ENNReal.toReal_le_of_le_ofReal ?_ ?_
· exact zero_le_one' ℝ
· simp_all only [ENNReal.ofReal_one]
exact PMF.coe_le_one p a
Now I'm trying to write a sanity-check for uniform distributions as an example def
example : entropy (PMF.uniformOfFinset ({0, 1} : Finset ℕ) (by simp) : PMF ℕ) = Real.log 2 := by
classical
simp only [entropy, PMF.uniformOfFinset]
-- We compute: - (1/2) * log(1/2) - (1/2) * log(1/2) = - log(1/2) = log 2
have h1 : Real.negMulLog ((2⁻¹ : ENNReal).toReal) = Real.log 2 / 2 := by
simp [ENNReal.toReal_inv, ENNReal.toReal_ofNat]
simp [Real.negMulLog]
ring_nf
have h2 : (fun a : ℕ => Real.negMulLog (ENNReal.toReal (if a = 0 ∨ a = 1 then (2⁻¹ : ENNReal) else 0)))
= (fun a : ℕ => (if a = 0 then Real.log 2 / 2 else 0) + (if a = 1 then Real.log 2 / 2 else 0)) := by
funext a
aesop
simp [h2]
sorry
at the sorry, I end up with the following goal:
⊢ ∑' (a : ℕ), ((if a = 0 then Real.log 2 / 2 else 0) + if a = 1 then Real.log 2 / 2 else 0) = Real.log 2
At this point, I expected to be able to split the sum using tsum_add, but tsum_add is an unknown identifier in my file. I can see and use:
#check ENNReal.tsum_add
-- ENNReal.tsum_add.{u_1} {α : Type u_1} {f g : α → ℝ≥0∞} : ∑' (a : α), (f a + g a) = ∑' (a : α), f a + ∑' (a : α), g a
but this only applies to functions valued in ℝ≥0∞, whereas my goal is in ℝ.
My questions are:
-
Design question: since Shannon entropy is nonnegative, would it be more idiomatic in mathlib to define
entropy : PMF α → ℝ≥0∞`?
instead ofℝ, and only coerce toℝwhen needed?
This would allow direct use of lemmas likeENNReal.tsum_addand might avoid some of the friction I’m seeing. On the other hand, the usual analytic definition involvesReal.logandReal.negMulLog, so I’m not sure what the preferred approach is in mathlib. -
What is the correct lemma (and import) for splitting a
tsumofℝ-valued functions? Is there anℝanalogue ofENNReal.tsum_addthat I should be using, possibly under a different name or requiring extra hypotheses? - If such a lemma is not readily available in this context, what is the idiomatic mathlib way to handle a finitely-supported
tsumlike this?
Aaron Liu (Jan 27 2026 at 23:50):
@loogle tsum, _ + _
loogle (Jan 27 2026 at 23:50):
:search: tsum_bool, Summable.tsum_add, and 117 more
Aaron Liu (Jan 27 2026 at 23:51):
Semar Augusto (Jan 27 2026 at 23:58):
Ok, I'll try it out! thank you!
Peter Kosenko (Jan 28 2026 at 19:26):
Here's my take on your proof, inspired by my attempts to work with PMF measures in my project:
example : entropy (PMF.uniformOfFinset ({0, 1} : Finset ℕ) (by simp) : PMF ℕ) = Real.log 2 := by
classical
simp only [entropy, PMF.uniformOfFinset]
-- We compute: - (1/2) * log(1/2) - (1/2) * log(1/2) = - log(1/2) = log 2
have h1 : Real.negMulLog ((2⁻¹ : ENNReal).toReal) = Real.log 2 / 2 := by
simp [ENNReal.toReal_inv, ENNReal.toReal_ofNat]
simp [Real.negMulLog]
ring_nf
have h2 : (fun a : ℕ => Real.negMulLog (ENNReal.toReal (if a = 0 ∨ a = 1 then (2⁻¹ : ENNReal) else 0)))
= (fun a : ℕ => (if a = 0 then Real.log 2 / 2 else 0) + (if a = 1 then Real.log 2 / 2 else 0)) := by
funext a
aesop
simp [h2]
have hsumm1 : HasSum (fun (n : ℕ) => (if n = 0 then Real.log 2 / 2 else 0)) (Real.log 2 / 2) := by
apply hasSum_ite_eq 0
have hsumm2 : HasSum (fun (n : ℕ) => (if n = 1 then Real.log 2 / 2 else 0)) (Real.log 2 / 2) := by
apply hasSum_ite_eq 1
rw [Summable.tsum_add (HasSum.summable hsumm1) (HasSum.summable hsumm2)]
simp only [tsum_ite_eq, add_halves]
Peter Kosenko (Jan 28 2026 at 19:27):
There are somehow more methods available with HasSum compared to Summable as I have found
Semar Augusto (Jan 28 2026 at 21:44):
I was able to close the proof but this is a LOT cleaner,
Really like the HasSum approach, I'll try to replicate this in my other proofs as well, very helpful, thanks!
Adomas Baliuka (Feb 14 2026 at 00:08):
Since this is called "generic formalization", I'd like to continue discussion here on how we want to formalize generalizations. I'm very happy to see https://github.com/leanprover-community/mathlib4/pull/34773 and am personally in favor of merging the definition using PMF.
Looking ahead, I see two directions for generalization:
- Application beyond
PMF
One could define differential entropy as KL-divergence to Lebesgue measure. This is a generalization of Shannon entropy, though apparently a flawed one. One should maybe instead go for Limiting density of discrete points? Though I must admit that I don't understand some of the subtleties there and how to best formalize it (I'm eager to help with proving/writing some lemmas though when we get there).
Another question is whether one should view the various quantum information theory notions as generalizations/extensions, or whether only physicists care about those and they have no place in Mathlib at all?
- Rényi entropy, conditional entropy, mutual information
How do we want to define these and at what level of generality? If we define all of them for PMFs now, will it be easy to generalize without breaking the API?
Yongxi Lin (Aaron) (Feb 14 2026 at 02:42):
I don't know a lot of things about entropy, but I think whenever one encounters a definition in probability one should define it measure theoretically in Mathlib. After you have a definition for a general (probability) measure, you can then show some nice things happen in the discrete case.
Rémy Degenne (Feb 14 2026 at 07:57):
For the more general Measure case we already have InformationTheory.klDiv (the Kullback-Leibler divergence). In the finite case on a type of card N, entropy is H(P) = log N - KL(P, uniform) and that equality could be used to reuse proofs about KL.
It also makes sense to define Rényi divergence and mutual information on Measure (and they could both be defined very easily using the KL). I don't know about the discrete case for those, I never work with that.
Rémy Degenne (Feb 14 2026 at 08:38):
But although we have klDiv, we don't have yet two of the most important properties of the KL divergence: the chain rule and the data processing inequality. The chain rule is in #34841 (depends on #35014 and #35015) and the proof I have in mind for the data processing inequality will need Jensen's inequality for the conditional expectation (#27953) as well as the results of #35089.
Adomas Baliuka (Feb 14 2026 at 13:16):
@Rémy Degenne Is there something there someone like me might be able to help with? Proving some lemmas, etc.?
Adomas Baliuka (Feb 14 2026 at 13:20):
Yongxi Lin (Aaron) said:
I don't know a lot of things about entropy, but I think whenever one encounters a definition in probability one should define it measure theoretically in Mathlib. After you have a definition for a general (probability) measure, you can then show some nice things happen in the discrete case.
I agree but the implications for Shannon entropy aren't very clear. Do we use differential entropy as KL-divergence to Lebesgue measure, or something like Limiting density of discrete points, which I personally don't know how to define for general measures, (but would likely work in terms of KL-divergence)?
Frédéric Dupuis (Feb 14 2026 at 14:44):
Rémy Degenne said:
But although we have
klDiv, we don't have yet two of the most important properties of the KL divergence: the chain rule and the data processing inequality. The chain rule is in #34841 (depends on #35014 and #35015) and the proof I have in mind for the data processing inequality will need Jensen's inequality for the conditional expectation (#27953) as well as the results of #35089.
Wasn't the data processing inequality in your project with @Lorenzo Luccioli ?
Rémy Degenne (Feb 14 2026 at 15:08):
Yes it was (link to the project: https://github.com/RemyDegenne/testing-lower-bounds/). 3 proofs of it in fact.
The only one we had fully sorry-free was the one you can find for example in the book by Polyanskiy and Wu, which had the downside of working only in spaces such that Radon-Nikodym derivatives of kernels exist (standard Borel spaces for example).
The second one, which we had up to one annoying sorry in an integration by part lemma, was the proof that can be found for example in this paper of Liese and Vajda which writes the KL as an integral of risks in Bayesian testing experiments. This second proof works in any measurable space, but it was not fully sorry-free. Also the definition of KL in Mathlib is not the one we had in that project (we changed some type choices about Real vs ENNReal vs EReal when moving to Mathlib).
The third proof uses conditional Jensen (which is a sorry in our project, but is now in a Mathlib PR by Yongxi Lin (Aaron) ) and is more direct, and follows the arguments from Csiszar's 1963 paper Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Many of the things needed for that proof are already proved in our project and the PRs I opened recently that I mentioned above are mostly just me adapting code from it (but I also changed the chain rule proof a bit to avoid a standard Borel space assumption).
So the data processing inequality is mostly just waiting for Jensen for conditional expectations to reach Mathlib. I just meant that it was not yet in Mathlib. I was not suggesting that somebody works on it, sorry if that was confusing.
Alex Meiburg (Feb 19 2026 at 05:55):
Adomas Baliuka said:
whether one should view the various quantum information theory notions as generalizations/extensions, or whether only physicists care about those and they have no place in Mathlib at all?
Rényi entropy, conditional entropy, mutual information
We have several of these in progress at http://github.com/Timeroot/Lean-QuantumInfo. If you're interested in that kind of stuff, come drop in #Quantum information > channel events :)
Last updated: Feb 28 2026 at 14:05 UTC