Zulip Chat Archive

Stream: mathlib4

Topic: thoughts on gradient


Tomas Skrivan (Sep 25 2023 at 21:04):

I'm working on automation for computing derivatives and I saw that there is ongoing work on gradients. It might be a good idea to start a discussion on what I have learned, how my work can be aligned with mathlib and how to reduce duplication of work.

I have found mathlib's definition of derivative fderiv and of adjoint/inner product spaces hard to work with. At some point, I ran into the issue of proving fderiv ℝ f = fderiv ℝ f and after a long time I realized that the function f : E → ι → F is getting two different norms, one from E → ι → F and one from E → PiLp 2 (fun _ : ι => F). Figuring this out was quite difficult as PiLp 2 (fun _ : ι => F) is defeq to ι → F and it completely disappeared from the types just the norm instance was different.

I find the situation with PiLp and ProdLp(which does not even exists) really bad from usability perspective. Imagine you write bunch of code with normal arrows and products and then realize that you want to compute its derivative. Suddenly you have to rewrite everything in terms of PiLp and ProdLp in order to be able to compute the gradient.

I propose two things

  • use Gateaux derivative as the canonical derivative
  • define adjoint on a wider class of spaces

The aim is to make gradient not depend on the norm and allow computations with normal function and product types.


Gateaux derivative

Gateaux derivative is complete generalization of Fréchet derivative. If Fréchet derivative exists then it is equal to Gateaux derivative. The canonical notion of differentiability can still be that Fréchet derivative exists. Using Gateaux derivative has the advantage that it does not depend on the norm as it can be defined on arbitrary locally convex topological vector space. This allow us to introduce relaxed notions of differentiability, such as smooth maps are those map that preserve smooth curves, for which chain rule holds and thus we can still do computations.


Generalization of adjoint

Here I will sketch how can we generalize the definition of adjoint to spaces that do not have proper inner product. I'm taking inspiration from theory of distributions, relevant wiki: https://en.wikipedia.org/wiki/Distribution_(mathematics)#Preliminaries:_Transpose_of_a_linear_operator.

If we have a continuous linear map between test functions A:D(U)D(U)A : \mathcal{D}(U) \rightarrow \mathcal{D}(U) then there exists its transpose At:D(U)D(U)A^t : \mathcal{D}'(U) \rightarrow \mathcal{D}'(U) which maps distributions to distributions. We can ask if there is a continuous linear map A:C(U)C(Y)A^\dagger : C^\infty(U) \rightarrow C^\infty(Y) such that

fC(U),At(Df)=DA(f)\forall f \in C^\infty(U) , A^t(D_f) = D_{A^\dagger(f)}

where DfD_f is distribution corresponding to the function ff.

I'm proposing to define adjoint of A:C(U)C(U)A : C^\infty(U) \rightarrow C^\infty(U) as follows: if AA maps test functions to test functions and there exists map A:C(U)C(U)A^\dagger : C^\infty(U) \rightarrow C^\infty(U) such that the above holds then AA^\dagger is adjoint of AA. Under this definition, the adjoint of derivative is negative derivative (ff)=(ff)\left( f \mapsto f' \right)^\dagger = \left( f \mapsto -f' \right). On the other hand not every continous linear map has adjoint, the adjoint of constant map would be ((x:R)((y:R)x))=(ff(x)dx)\left( (x : \mathbb{R}) \mapsto \left( (y : \mathbb{R}) \mapsto x \right) \right)^\dagger = \left( f \mapsto \int_{-\infty}^{\infty} f(x) \,dx\right) but that is not well defined map as the integral does not have to exists for arbitrary fC(R)f \in C^\infty(\mathbb{R}).

In general, this requires to equip a generic topological vector space X with a linear subspace 𝒟 X of "test functions" and duality/inner product ⟨·,·⟩ : X → 𝒟 X → ℝ which should be bilinear. Furthermore, it should satisfy analog of fundamental lemma of calculus of variation ∀ (x : X), (∀ φ : 𝒟 X, ⟨x,φ⟩ = 0) → x = 0. Continuous linear map A : X → Y has adjoint A† : Y → X if it preserves test functions, x ∈ 𝒟 X → A x ∈ 𝒟 Y, and satisfies ∀ (y : Y) (x ∈ 𝒟 X), ⟨A† y, x⟩ = ⟨y, A x⟩.

For example, we can define adjoint for maps between (ℤ → ℝ)

(fun (a :   ) i => a (i+1) - a i)
=
fun (a :   ) i => - (a i) - a (i-1)

which was a calculation I encountered when doing some signal processing.


With these two changes we can define gradient that does not depend on the norm and allows us to do gradient computations without using PiLp or ProdLp which I find problematic from usability and automation perspective. I would be curious to hear opinion if this would be too disruptive change to mathlib or not. I'm coming mainly from the point of automatizing derivative computations and don't have much clue about intricate proofs involving derivatives and adjoints.

Heather Macbeth (Sep 25 2023 at 23:38):

Tomas Skrivan said:

I have found mathlib's definition of derivative fderiv and of adjoint/inner product spaces hard to work with. At some point, I ran into the issue of proving fderiv ℝ f = fderiv ℝ f and after a long time I realized that the function f : E → ι → F is getting two different norms, one from E → ι → F and one from E → PiLp 2 (fun _ : ι => F). Figuring this out was quite difficult as PiLp 2 (fun _ : ι => F) is defeq to ι → F and it completely disappeared from the types just the norm instance was different.

@Tomas Skrivan It'd be interesting to look at some examples, if you have them. Is it possible that you are doing proofs in co-ordinates that should be done more abstractly?

Tomas Skrivan (Sep 25 2023 at 23:44):

My main motivation is to do code transformation, so I'm trying to differentiate any kind of code I might encounter. This means that I do calculations in coordinates and there is not much I can do about it. I have a bunch of tests here https://github.com/lecopivo/SciLean/blob/master/test/basic_gradients.lean (the notation ⊞ i => dx i is like lambda function but it creates an array)

Heather Macbeth (Sep 25 2023 at 23:44):

Tomas Skrivan said:

Imagine you write bunch of code with normal arrows and products and then realize that you want to compute its derivative. Suddenly you have to rewrite everything in terms of PiLp and ProdLp in order to be able to compute the gradient.

I propose two things

  • use Gateaux derivative as the canonical derivative
  • define adjoint on a wider class of spaces

Another solution I've wondered about here is to entirely dispense with a canonical metric/norm on ι → F and instead let the user put it in with local instance where needed.

Heather Macbeth (Sep 25 2023 at 23:47):

Tomas Skrivan said:

I propose two things

  • use Gateaux derivative as the canonical derivative
  • define adjoint on a wider class of spaces

The aim is to make gradient not depend on the norm and allow computations with normal function and product types.

There's a part I don't follow, sorry -- how would this make the gradient independent of the norm structure? (What is your definition of gradient in a non-Hilbert setting?)

Tomas Skrivan (Sep 25 2023 at 23:51):

Gradient is adjoint of differential as usual. I propose a generalized notion of adjoint to spaces that have almost inner product but not quite e.g. smooth functions on reals or arbitrary sequences. The price to pay is that not every continuous linear map has adjoint.

Tomas Skrivan (Sep 25 2023 at 23:55):

(deleted)

Tomas Skrivan (Sep 25 2023 at 23:56):

Tomas Skrivan said:

Heather Macbeth said:
Another solution I've wondered about here is to entirely dispense with a canonical metric/norm on ι → F and instead let the user put it in with local instance where needed.

I think that would solve my issues with automation too.

Eric Wieser (Sep 25 2023 at 23:57):

Isn't there a proposal somewhere to generalize fderiv to a TVS instead of a normed space?

Eric Wieser (Sep 25 2023 at 23:57):

I think this would make the PiLp problem vanish

Eric Wieser (Sep 25 2023 at 23:58):

(and it's necessary to make it reasonable to do anything about derivatives of matrices)

Tomas Skrivan (Sep 26 2023 at 00:02):

I don't think that would solve the issue with gradients though.

Heather Macbeth (Sep 26 2023 at 00:03):

To set some notation, consider a function ERE\to \mathbb{R} and a point x:Ex:E, so that the Frechet derivative is a continuous linear map ERE\to \mathbb{R}. The proposal in
https://leanprover.zulipchat.com/#narrow/stream/287929-mathlib4/topic/define.20gradient.20of.20scalar-valued.20function.20on.20a.20Hilbert.20space
is not to define
Tomas Skrivan said:

Gradient is adjoint of differential

(that would make the gradient a linear map RE\mathbb{R}\to E), but rather to define the gradient to be an element of EE directly (by the Frechet-Riesz isomorphism with the dual).

Tomas Skrivan (Sep 26 2023 at 00:08):

I'm sorry, I was too slopy. Gradient is adjoint of differential and apply 1 to it. f(x)=(vf(x;v))1\nabla f (x) = \left(v \mapsto f'(x;v) \right)^\dagger 1. That is the same as Frechet-Riesz isomorphism, no?

Tomas Skrivan (Sep 26 2023 at 00:10):

(In my Lean code I just detect during elaboration whether the codomain is scalar or not and plug in one or not, so in my head I treat both objects as "gradient")

Heather Macbeth (Sep 26 2023 at 00:10):

But if you want to do this in general, you need to fix an identification of EE^* (i.e., linear maps ERE\to\mathbb{R}) with EE -- which is the same as a nondegenerate bilinear form on EE, which is getting pretty close to an inner product.

Tomas Skrivan (Sep 26 2023 at 00:16):

Tomas Skrivan said:

In general, this requires to equip a generic topological vector space X with a linear subspace 𝒟 X of "test functions" and duality/inner product ⟨·,·⟩ : X → 𝒟 X → ℝ which should be bilinear. Furthermore, it should satisfy analog of fundamental lemma of calculus of variation ∀ (x : X), (∀ φ : 𝒟 X, ⟨x,φ⟩ = 0) → x = 0. Continuous linear map A : X → Y has adjoint A† : Y → X if it preserves test functions, x ∈ 𝒟 X → A x ∈ 𝒟 Y, and satisfies ∀ (y : Y) (x ∈ 𝒟 X), ⟨A† y, x⟩ = ⟨y, A x⟩.

Well this is the part that outlines how to do it in general, the diversion is that you do not have non-degenerate bilinear form EERE \rightarrow E \rightarrow \mathbb{R} but only bilinear form EDERE \rightarrow \mathcal{D}_E \rightarrow \mathbb{R} where DE\mathcal{D}_E is generalized space of test functions. For example DC(U)=Cc(U)\mathcal{D}_{C^\infty(U)} = C^\infty_c(U) or DNR=N0R\mathcal{D}_{\mathbb{N} \rightarrow \mathbb{R}}=\mathbb{N} \rightarrow_0 \mathbb{R}

Heather Macbeth (Sep 26 2023 at 00:17):

I think our notation is different, my EE is e.g. Rn\mathbb{R}^n.

Tomas Skrivan (Sep 26 2023 at 00:22):

I don't think so :) where am I unclear?

Heather Macbeth (Sep 26 2023 at 00:26):

It seems to me that you are taking e.g. E=C(U,R)E = C^\infty(U,\mathbb{R}) for URnU\subseteq \mathbb{R}^n?

Heather Macbeth (Sep 26 2023 at 00:28):

Tomas Skrivan said:

For example, we can define adjoint for maps between (ℤ → ℝ)

(fun (a :   ) i => a (i+1) - a i)
=
fun (a :   ) i => - (a i) - a (i-1)

which was a calculation I encountered when doing some signal processing.

On a slightly different note (maybe this should be split into a separate conversation), have you tried writing this as a calculation in 2(Z,R)\ell^2(\mathbb{Z}, \mathbb{R})? (docs#lp)

Heather Macbeth (Sep 26 2023 at 00:30):

(Because 2\ell^2 is a Hilbert space so this is perfectly well-formed according to current mathlib definitions.)

Tomas Skrivan (Sep 26 2023 at 00:30):

Ohh sorry, maybe I should have written E=C(R,R)E=C^\infty(\mathbb{R}, \mathbb{R}) which is a space that does not really have a sensible notion of inner product but I want to claim that it still makes sense to define adjoint for certain linear maps f:EEf : E \rightarrow E

Tomas Skrivan (Sep 26 2023 at 00:32):

Heather Macbeth said:

Tomas Skrivan said:

For example, we can define adjoint for maps between (ℤ → ℝ)

(fun (a :   ) i => a (i+1) - a i)
=
fun (a :   ) i => - (a i) - a (i-1)

which was a calculation I encountered when doing some signal processing.

On a slightly different note (maybe this should be split into a separate conversation), have you tried writing this as a calculation in 2(Z,R)\ell^2(\mathbb{Z}, \mathbb{R})? (docs#lp)

What if I want to do such calculation/give a meaning to such calculation to sequences that do not have finite sum of squares?

Heather Macbeth (Sep 26 2023 at 00:33):

Sorry, you mentioned N0R\mathbb{N}\rightarrow_0 \mathbb{R} before
Tomas Skrivan said:

For example DC(U)=Cc(U)\mathcal{D}_{C^\infty(U)} = C^\infty_c(U) or DNR=N0R\mathcal{D}_{\mathbb{N} \rightarrow \mathbb{R}}=\mathbb{N} \rightarrow_0 \mathbb{R}

so I assumed you were already restricting to finite-support sequences.

Tomas Skrivan (Sep 26 2023 at 00:34):

In particular reasoning about all these integrability conditions is really tricky/hard to automate but completely irrelevant to calculations.

Tomas Skrivan (Sep 26 2023 at 00:41):

For sequences, I want to define adjoint for function fun (a : ℤ → ℝ) i => a (i+1) - a i of type (ℤ → ℝ)→ (ℤ → ℝ) there is no sensible inner product on the whole (ℤ → ℝ) but there is the obvious bilinear function B : (ℤ → ℝ)→ (ℤ →₀ ℝ) → ℝ, B := fun x y => ∑ i, x i * y i, and I'm claiming that this is sufficient to define adjoint.

Heather Macbeth (Sep 26 2023 at 01:44):

So I think I'm still a bit puzzled by your slogan
Tomas Skrivan said:

Gradient is adjoint of differential and apply 1 to it.

Isn't the adjoint of the Frechet derivative the divergence? (Speaking strictly formally, and making enough Frechet-Riesz identifications of a Hilbert space with its dual.). And, as we were discussing above, the current proposal
https://leanprover.zulipchat.com/#narrow/stream/287929-mathlib4/topic/define.20gradient.20of.20scalar-valued.20function.20on.20a.20Hilbert.20space
for defining the gradient does not involve adjoints.

Heather Macbeth (Sep 26 2023 at 02:02):

Maybe what you're trying to say is that you want the gradient to be defined in a setting where, instead of an inner product on the domain, you have a slightly weaker thing -- a pairing that works only on a dense subset. But in the case of Rn\mathbb{R}^n a dense subset is still all of it.

Tomas Skrivan (Sep 26 2023 at 02:08):

For any Hilbert space the dense subset is all of it.

Tomas Skrivan (Sep 26 2023 at 02:11):

But the proposed gradient depends on (toDual 𝕜 F).symm which is the Frechet-Riesz identification which can be equivalently written using adjoint, no? i.e. (toDual 𝕜 F).symm = fun f => f† 1

Tomas Skrivan (Sep 26 2023 at 02:25):

Heather Macbeth said:

So I think I'm still a bit puzzled by your slogan
Tomas Skrivan said:

Gradient is adjoint of differential and apply 1 to it.

Isn't the adjoint of the Frechet derivative the divergence?

That would be talking about different arrows.

fderiv : (X  Y)  (X  X  Y)

grad : (X  )  (X  X)
grad f x := (fderiv f x) 1

div : (X  X)  (X  )
div f := grad f

Heather Macbeth (Sep 26 2023 at 02:42):

Tomas Skrivan said:

But the proposed gradient depends on (toDual 𝕜 F).symm which is the Frechet-Riesz identification which can be equivalently written using adjoint, no? i.e. (toDual 𝕜 F).symm = fun f => f† 1

In the mathlib construction it goes the other way: the adjoint depends on (is defined in terms of) the Frechet-Riesz identification!

Heather Macbeth (Sep 26 2023 at 02:51):

So, to rephrase my question: you are proposing a generalization of the gradient to a setting when the domain has slightly-less-than-an-inner-product, but the case where you want to use it is domain Rn\mathbb{R}^n which already has an inner product ... why do you want this generalization?

Tomas Skrivan (Sep 26 2023 at 02:58):

This way I can define this slightly-less-then-inner-product on arrow and product type and keep the infinity norm on these spaces as it is. No need to use PiLp or ProdLp to compute gradients. The point is that slightly-less-then-inner-product does not induce norm and this won't cause type class instance clash with already existing norms. As a bonus this generalization allows to formally state div = grad† which is quite useful in some physics calculations.

Heather Macbeth (Sep 26 2023 at 03:11):

So if I understand correctly, in the case of Rn\mathbb{R}^n this is a point about typeclasses rather than a point about mathematics: you could get the same effect (in that case) by having the (existing, non-generalized) [InnerProductSpace] typeclass not imply a norm?

Tomas Skrivan (Sep 26 2023 at 03:16):

Yes, now I'm realizing that I have done a very bad job at separating these two issues from each other. Sorry about that

Kevin Buzzard (Sep 26 2023 at 08:55):

Talking to other people about your ideas is a great way of organising them!

Eric Wieser (Sep 26 2023 at 09:12):

Heather Macbeth said:

You could get the same effect (in that case) by having the (existing, non-generalized) [InnerProductSpace] typeclass not imply a norm?

Is the point here that there's no reason to consider the L2 norm as canonical because there are other norms, but there's only one sensible inner product?

Tomas Skrivan (Sep 26 2023 at 13:01):

I think that is fair reasoning but I came to the conclusion more from the practical side when dealing with PiLp and ProdLp got really painful.


Last updated: Dec 20 2023 at 11:08 UTC