Chain rule for the Kullback-Leibler divergence #

Suppose that we have two finite joint measures on a product 𝓧 × 𝓨, which can be decomposed as μ ⊗ₘ κ and ν ⊗ₘ η, where μ and ν are measures on 𝓧 and κ and η are Markov kernels from 𝓧 to 𝓨. Then we can express the Kullback-Leibler divergence between these two joint measures as a sum of klDiv μ ν and the conditional Kullback-Leibler divergence between the kernels κ and η, averaged over μ. The resulting equality is most often written as klDiv (μ ⊗ₘ κ) (ν ⊗ₘ η) = klDiv μ ν + μ[fun x ↦ klDiv (κ x) (η x)].

Here we first prove the following version: klDiv (μ ⊗ₘ κ) (ν ⊗ₘ η) = klDiv μ ν + klDiv (μ ⊗ₘ κ) (μ ⊗ₘ η). This version avoids the issue of measurability of the function x ↦ klDiv (κ x) (η x), which is not always guaranteed, and thus holds for all measurable spaces 𝓧 and 𝓨, without any assumptions.

Main statements #

klDiv_compProd_eq_add: klDiv (μ ⊗ₘ κ) (ν ⊗ₘ η) = klDiv μ ν + klDiv (μ ⊗ₘ κ) (μ ⊗ₘ η)
klDiv_compProd_left: klDiv (μ ⊗ₘ κ) (ν ⊗ₘ κ) = klDiv μ ν

Proof #

The main ingredient is the chain rule for Radon-Nikodym derivatives: ∂(μ ⊗ₘ κ)/∂(ν ⊗ₘ η) = ∂μ/∂ν * ∂(μ ⊗ₘ κ)/∂(μ ⊗ₘ η). Then, omitting edge cases, the Kullback-Leibler divergence is an integral of a logarithm of the derivative on the left, which decomposes into a sum of two integrals of logarithms. We now give a more detailed outline of the proof.

The Kullback-Leibler divergence klDiv μ ν is defined with an if-then-else statement: if the measures are absolutely continuous (μ ≪ ν) and the log-likelihood ratio llr μ ν is integrable, then it is defined as ∫ x, llr μ ν x ∂μ + ν.real univ - μ.real univ, otherwise it is defined to be ∞.

We first deal with the case in which absolute continuity does not hold. The main observation is that μ ⊗ₘ κ ≪ ν ⊗ₘ η ↔ μ ≪ ν ∧ μ ⊗ₘ κ ≪ μ ⊗ₘ η, which means that if one of the two sides of the KL equality is infinite because of lack of absolute continuity, then the other side is also infinite for the same reason.

Then, we deal with the case in which absolute continuity holds but integrability does not. Again, we can show a similar equivalence for integrability, which allows us to conclude that both sides are infinite. Integrable (llr (μ ⊗ₘ κ) (ν ⊗ₘ η)) (μ ⊗ₘ κ) is equivalent to Integrable (llr μ ν) μ ∧ Integrable (llr (μ ⊗ₘ κ) (μ ⊗ₘ η)) (μ ⊗ₘ κ). The proof of this equivalence relies on the convexity of the function x ↦ x * log x.

Finally, we prove the equality in the case in which both absolute continuity and integrability hold. In that case, klDiv μ ν = ∫ x, llr μ ν x ∂μ + ν.real univ - μ.real univ and similarly for the other terms. It is easy to see that it suffices to prove the equality of the integrals parts. Finally, the computation for the integral of the log-likelihood ratio is as follows:

∫ p, llr (μ ⊗ₘ κ) (ν ⊗ₘ η) p ∂(μ ⊗ₘ κ)
_ = ∫ p, ((∂μ ⊗ₘ κ/∂ν ⊗ₘ η) p).toReal * log ((∂μ ⊗ₘ κ/∂ν ⊗ₘ η) p).toReal ∂(ν ⊗ₘ η)
_ = ∫ p, ((∂μ ⊗ₘ κ/∂ν ⊗ₘ η) p).toReal *
    (log ((∂μ/∂ν) p.1).toReal + log ((∂μ ⊗ₘ κ/∂μ ⊗ₘ η) p).toReal) ∂(ν ⊗ₘ η)
_ = ∫ p, (log ((∂μ/∂ν) p.1).toReal + log ((∂μ ⊗ₘ κ/∂μ ⊗ₘ η) p).toReal) ∂(μ ⊗ₘ κ)
_ = ∫ p, log ((∂μ/∂ν) p.1).toReal ∂(μ ⊗ₘ κ) + ∫ p, log ((∂μ ⊗ₘ κ/∂μ ⊗ₘ η) p).toReal ∂(μ ⊗ₘ κ)
_ = ∫ a, llr μ ν a ∂μ + ∫ p, llr (μ ⊗ₘ κ) (μ ⊗ₘ η) p ∂(μ ⊗ₘ κ)

TODO #

Add a version of the chain rule for the integral form of the contional KL divergence, i.e. μ[fun x ↦ klDiv (κ x) (η x)].

source

theorem InformationTheory.integrable_llr_of_integrable_llr_compProd {𝓧 : Type u_1} {𝓨 : Type u_2} {m𝓧 : MeasurableSpace 𝓧} {m𝓨 : MeasurableSpace 𝓨} {μ ν : MeasureTheory.Measure 𝓧} {κ η : ProbabilityTheory.Kernel 𝓧 𝓨} [MeasureTheory.IsFiniteMeasure μ] [MeasureTheory.IsFiniteMeasure ν] [ProbabilityTheory.IsMarkovKernel κ] [ProbabilityTheory.IsMarkovKernel η] (h_ac : (μ.compProd κ).AbsolutelyContinuous (ν.compProd η)) (h_int : MeasureTheory.Integrable (MeasureTheory.llr (μ.compProd κ) (ν.compProd η)) (μ.compProd κ)) :

MeasureTheory.Integrable (MeasureTheory.llr μ ν) μ

If the log-likelihood ration between two composition-products is integrable, then so is the log-likelihood ratio between the two measures on the first space.

source

theorem InformationTheory.rnDeriv_compProd_mul_log_eq_mul_add {𝓧 : Type u_1} {𝓨 : Type u_2} {m𝓧 : MeasurableSpace 𝓧} {m𝓨 : MeasurableSpace 𝓨} {μ ν : MeasureTheory.Measure 𝓧} {κ η : ProbabilityTheory.Kernel 𝓧 𝓨} [MeasureTheory.IsFiniteMeasure μ] [MeasureTheory.IsFiniteMeasure ν] [ProbabilityTheory.IsMarkovKernel κ] [ProbabilityTheory.IsMarkovKernel η] (h_ac : (μ.compProd κ).AbsolutelyContinuous (μ.compProd η)) :

∀ᵐ (p : 𝓧 × 𝓨) ∂ν.compProd η, ((μ.compProd κ).rnDeriv (ν.compProd η) p).toReal * Real.log ((μ.compProd κ).rnDeriv (ν.compProd η) p).toReal = ((μ.compProd κ).rnDeriv (ν.compProd η) p).toReal * (Real.log (μ.rnDeriv ν p.1).toReal + Real.log ((μ.compProd κ).rnDeriv (μ.compProd η) p).toReal)

source

theorem InformationTheory.integrable_llr_compProd_iff {𝓧 : Type u_1} {𝓨 : Type u_2} {m𝓧 : MeasurableSpace 𝓧} {m𝓨 : MeasurableSpace 𝓨} {μ ν : MeasureTheory.Measure 𝓧} {κ η : ProbabilityTheory.Kernel 𝓧 𝓨} [MeasureTheory.IsFiniteMeasure μ] [MeasureTheory.IsFiniteMeasure ν] [ProbabilityTheory.IsMarkovKernel κ] [ProbabilityTheory.IsMarkovKernel η] (h_ac : (μ.compProd κ).AbsolutelyContinuous (ν.compProd η)) :

MeasureTheory.Integrable (MeasureTheory.llr (μ.compProd κ) (ν.compProd η)) (μ.compProd κ) ↔ MeasureTheory.Integrable (MeasureTheory.llr μ ν) μ ∧ MeasureTheory.Integrable (MeasureTheory.llr (μ.compProd κ) (μ.compProd η)) (μ.compProd κ)

source

theorem InformationTheory.integral_llr_compProd_eq_add {𝓧 : Type u_1} {𝓨 : Type u_2} {m𝓧 : MeasurableSpace 𝓧} {m𝓨 : MeasurableSpace 𝓨} {μ ν : MeasureTheory.Measure 𝓧} {κ η : ProbabilityTheory.Kernel 𝓧 𝓨} [MeasureTheory.IsFiniteMeasure μ] [MeasureTheory.IsFiniteMeasure ν] [ProbabilityTheory.IsMarkovKernel κ] [ProbabilityTheory.IsMarkovKernel η] (h_ac : (μ.compProd κ).AbsolutelyContinuous (ν.compProd η)) (h_int : MeasureTheory.Integrable (MeasureTheory.llr (μ.compProd κ) (ν.compProd η)) (μ.compProd κ)) :

∫ (p : 𝓧 × 𝓨), MeasureTheory.llr (μ.compProd κ) (ν.compProd η) p ∂μ.compProd κ = ∫ (a : 𝓧), MeasureTheory.llr μ ν a ∂μ + ∫ (p : 𝓧 × 𝓨), MeasureTheory.llr (μ.compProd κ) (μ.compProd η) p ∂μ.compProd κ

Chain rule for the integral of the log-likelihood ratio, under absolute continuity and integrability assumptions.

source

@[simp]

theorem InformationTheory.klDiv_compProd_left {𝓧 : Type u_1} {𝓨 : Type u_2} {m𝓧 : MeasurableSpace 𝓧} {m𝓨 : MeasurableSpace 𝓨} (μ ν : MeasureTheory.Measure 𝓧) (κ : ProbabilityTheory.Kernel 𝓧 𝓨) [MeasureTheory.IsFiniteMeasure μ] [MeasureTheory.IsFiniteMeasure ν] [ProbabilityTheory.IsMarkovKernel κ] :

klDiv (μ.compProd κ) (ν.compProd κ) = klDiv μ ν

source

theorem InformationTheory.klDiv_compProd_eq_add {𝓧 : Type u_1} {𝓨 : Type u_2} {m𝓧 : MeasurableSpace 𝓧} {m𝓨 : MeasurableSpace 𝓨} (μ ν : MeasureTheory.Measure 𝓧) (κ η : ProbabilityTheory.Kernel 𝓧 𝓨) [MeasureTheory.IsFiniteMeasure μ] [MeasureTheory.IsFiniteMeasure ν] [ProbabilityTheory.IsMarkovKernel κ] [ProbabilityTheory.IsMarkovKernel η] :

klDiv (μ.compProd κ) (ν.compProd η) = klDiv μ ν + klDiv (μ.compProd κ) (μ.compProd η)

Chain rule for the Kullback-Leibler divergence, with conditional KL expressed using composition-products. This version holds without any assumption on the measurable spaces.

Documentation

Mathlib.InformationTheory.KullbackLeibler.ChainRule

Chain rule for the Kullback-Leibler divergence #

Main statements #

Proof #

TODO #