Zulip Chat Archive

Stream: mathlib4

Topic: License choice for mathlib 4


Adam Kurkiewicz (Aug 16 2023 at 06:58):

Mathlib 4 is released under Apache 2.0:

https://github.com/leanprover-community/mathlib4/blob/master/LICENSE

Is this a conscious choice and is the community developing mathlib comfortable with the consequences?

In a world, where formalisation of pure mathematics is of purely academic interest I suppose the license choice doesn't matter, but it does in a world where there is interest from the industry (e.g. competing in the IMO Grand Challenge).

I did a little bit of digging, and it doesn't seem like this was discussed before, and the original license choice for the previous version of Mathlib seems to have been made 6 years ago -- when the community was much smaller than what it is today.

Ruben Van de Velde (Aug 16 2023 at 07:03):

It's hardly likely that we'd be able to change it at this point anyway

Eric Wieser (Aug 16 2023 at 07:26):

I don't think we've ever had any complaints that Apache was not permissive enough

Eric Wieser (Aug 16 2023 at 07:26):

I think the driving reason for the choice was to match the licensing of Lean itself, which I assume was a decision made by MSR

Eric Wieser (Aug 16 2023 at 07:27):

(or possibly even that our hand was forced because mathlib grew out of the Apache-licensed core)

Scott Morrison (Aug 16 2023 at 07:48):

Note some previous discussion: https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/Can.20part.20of.20mathlib.20be.20public.20domain.3F/near/321101940

Mario Carneiro (Aug 16 2023 at 08:13):

Is this a conscious choice and is the community developing mathlib comfortable with the consequences?

Could you be more specific about what kind of consequences you are thinking about? Are you concerned with the license being too permissive or not permissive enough?

Adam Kurkiewicz (Aug 16 2023 at 08:32):

Personally I'm not concerned -- I haven't made any contributions to Mathlib -- so in a sense I don't have a horse in this race. But I think Apache being too permissive is a plausible concern here. I would say a scenario where a big company (e.g. Google, Facebook, OpenAI, etc...) creates a large corpus of "hidden" theorems as benchmark for their LLMs is likely. These theorems would link to Mathlib (which I believe would be permissible under Apache).

I suppose the question is, if contributors to Mathlib would be comfortable with this hypothetical scenario.

@Ruben Van de Velde re changing the license. You can usually fairly easily change from a permissive license. It's the other direction that's difficult (although not impossible -- I've been involved in a project where we contacted a bunch of maintainers of multiple projects to get a permission to include their GPLv2-licensed work in a GPLv3-licensed work).

It's fundamentally a question about "Freedom zero". I don't think there are easy answers here...

Scott Morrison (Aug 16 2023 at 08:33):

I for one very much hope that some big companies make large corpora of theorems that refer to Mathlib. :-)

Johan Commelin (Aug 16 2023 at 08:38):

Well, if that big company will then create closed source software, and start bribing universities to pay for licences, and charge students for using the software, etc...

Adam Kurkiewicz (Aug 16 2023 at 08:38):

I suppose the point is that they don't have to share -- they can just maintain them internally under proprietary license -- as a "competitive advantage" vs other companies.

Johan Commelin (Aug 16 2023 at 08:38):

I'm not a big fan of the state of the art of CAS at universities.

Ruben Van de Velde (Aug 16 2023 at 08:44):

Thinking in another direction: there seems to be some interest in using lean for software verification - would this make life difficult if you wanted to use mathlib in verifying properties of more permissively licensed code?

Also, what would be the impact on std4?

Adam Kurkiewicz (Aug 16 2023 at 08:47):

@Johan Commelin -- I think this is actually unlikely -- when DeepMind created AlphaZero (Chess and Go) and AlphaStar (Starcraft II), they didn't commercialise the technology. It was just a proprietary research project (you never get to see the code, you get to see a paper with description of the approach they took/ architecture of the neural network).

There are good reasons for this. If you're Google/Facebook you need your project to have a TAM (total addressable market) to be in excess of $10b USD to "move the needle" for the parent co.

This is why they only started commercialising AlphaFold -- because drugging a previously undruggable target can easily give you a TAM of $10b USD and more (you can read about "KRAS race", which a recent example here).

I don't think that selling CAS to universities is big enough market for the really big companies to care. But it could be market for smaller companies like WolframAlpha, etc... But these smaller companies don't typically have access to a really big capital, so what they can achieve is typically much less than the big guys.

Adam Kurkiewicz (Aug 16 2023 at 08:53):

@Ruben Van de Velde open source licensing is actually a surprisingly difficult field. I didn't realise before I started working with open source lawyers.

Typically some of the "bright lines" are as follows:

  • static or dynamic linking to a restrictively licensed project means you have to release under a restrictive license
  • other forms of communication (e.g. via network) are sufficiently "separated" that you don't have to release under restrictive license (exception are e.g. Affero GPL, which was designed specifically to counter this point)
  • using output of e.g. a restrictively licensed compiler is OK in a permissively licensed project

But there's lots of exceptions.

In your case there shouldn't be any impact on std4, as I believe std4 doesn't depend on Mathlib (dependency goes the other way?)

Yaël Dillies (Aug 16 2023 at 08:55):

... except that std4 is partially growing out of both mathlib and mathlib4 via upstreaming of definitions and lemmas

Adam Kurkiewicz (Aug 16 2023 at 08:58):

@Yaël Dillies well in this case changing Mathlib to a more restrictive license would indeed cause problems with permissive licensing of std4.

Adam Kurkiewicz (Aug 16 2023 at 09:02):

@Johan Commelin , but I think fundamentally you're right, "medium-sized" companies targeting academia as a market and using Mathlib under the hood is also a valid potential concern.

I just think it's distinct from the "big company research project" potential concern I shared earlier.

Eric Wieser (Aug 16 2023 at 09:04):

Johan Commelin said:

Well, if that big company will then create closed source software, and start bribing universities to pay for licences, and charge students for using the software, etc...

If mathlib is part of the interface to that software (rather than hidden internals) I don't see this working out while mathlib is still actively developed, because it's super important to be on a recent version.

Eric Wieser (Aug 16 2023 at 09:05):

And in the unlikely event this starts looking like a serious issue, we can re-license to trap the company on an old version.

Adam Kurkiewicz (Aug 16 2023 at 09:10):

@Scott Morrison -- thanks for linking to the previous thread. I might be making an assumption here, but I suppose licensing of Lean/Mathlib (e.g. coordinating the community, hiring open source lawyers, etc...) will become the responsibility of Lean FRO?

Eric Wieser (Aug 16 2023 at 09:11):

I would guess the Lean FRO has little interest in the cost of relicensing given Leo was probably involved in the original choice of license; of course I don't speak for the FRO!

Scott Morrison (Aug 16 2023 at 09:23):

As far as I'm aware the FRO is happy with current licensing. Certainly the FRO wants industrial users to be able to use Lean!

Yakov Pechersky (Aug 16 2023 at 12:57):

Re AlphaFold2, I am at the ACS meeting right now, and John Jumper gave one of the plenary talks. I don't see the direct connection between releasing AF2 and commercialization. The connection to the KRAS race is quite tenuous here - we've had structures of RAS proteins with bound ligands for ages. RAS is one the most well studied proteins out there.

And to the point of open source, AF2 participated in CASP, which is the protein structure science equivalent of the IMO Grand Challenge. Based on the ideas of AF2, Baker released RosettaFold in response, others released their own open source implementations of the same ideas... And DeepMind has collaborated with central non profits in the field to release more and more of their predictions such that 200 millions structures are now accessible through EMBL EBI.

Its not the code that's the magic sauce, it's the high quality data that the models are trained on (requiring in house curation of data) and the capital and resource access to be able to actually train the model. Note, they release the code, not the parameters (even that's changed). Similarly, LLaMa code is available, not the parameters.

Junyan Xu (Aug 16 2023 at 15:57):

Isomorphic Labs is a London-based drug discovery company, which uses artificial intelligence for drug discovery. Isomorphic Labs was incorporated in February 2021 and announced on November 5, 2021. It was established under Alphabet Inc. as a spin-off from its AI research lab Deepmind. It is led by Demis Hassabis.

The company draws upon Deepmind's AlphaFold 2 technology, which can be used to predict protein structures in the human body with high accuracy, allowing its researchers to find new target pathways for drug delivery.

Mario Carneiro (Aug 16 2023 at 15:57):

Adam Kurkiewicz said:

Personally I'm not concerned -- I haven't made any contributions to Mathlib -- so in a sense I don't have a horse in this race. But I think Apache being too permissive is a plausible concern here. I would say a scenario where a big company (e.g. Google, Facebook, OpenAI, etc...) creates a large corpus of "hidden" theorems as benchmark for their LLMs is likely. These theorems would link to Mathlib (which I believe would be permissible under Apache).

I suppose the question is, if contributors to Mathlib would be comfortable with this hypothetical scenario.

Okay, I understand better now what your concern is. Yes, this is a deliberate choice, Mathlib is intended to be usable in closed source scenarios, this was kind of a required part for MS and now Amazon to support this project through Leo and related funding. And for my own part, I think this is important - if I were to make a choice unconstrained by anything else I would choose PD or CC0, I think mathematics should be free in the most literal sense I can make it, and that includes being free to be remixed in arbitrary ways and incorporated in things that are not free.

Eric Wieser (Aug 16 2023 at 16:34):

Mario, I'm curious what makes CC0 or PD more desirable to you than MIT / Apache / BSD, as the later seem to be the typical choice for open source.

Mario Carneiro (Aug 16 2023 at 16:35):

I don't want there to be any requirements on users of the code, and BSD still has e.g. attribution-required

Mario Carneiro (Aug 16 2023 at 16:39):

By the way, the metamath set.mm library (which I have also been involved with for a long time) is public domain

Eric Wieser (Aug 16 2023 at 16:39):

"Attribution required" (in the form of citing papers) is pretty par for the course in academic mathematics, right?

Mario Carneiro (Aug 16 2023 at 16:40):

Sure, but academics does not use the copyright courts for enforcement

Mario Carneiro (Aug 16 2023 at 16:40):

I certainly want people to cite me when they use my work, but I don't need to use a legal document to enforce that

Mario Carneiro (Aug 16 2023 at 16:42):

By the way, even if something is public domain if you treat it as your own work without attribution this is plagiarism

Mario Carneiro (Aug 16 2023 at 16:46):

Mario Carneiro said:

Sure, but academics does not use the copyright courts for enforcement

And thinking about it now, that actually sounds kind of terrifying, like what if you could copyright a theorem, and anyone who wants to use the theorem has to check the license to make sure they don't have to pay a royalty, write the theorem reference in a way that affects the document under construction, or other things like that

Yakov Pechersky (Aug 16 2023 at 16:49):

Here's a relevant issue about licensing changes in an open source project. https://opentf.org/

Junyan Xu (Aug 16 2023 at 17:37):

A recent example is Hugging Face's text generation inference license change, which happened however after only a few months of development with dozens of contributors.

HFOIL is not a true open source license because we added a restriction: to sell a hosted or managed service built on top of TGI, we now require a separate agreement.

Yannic Kilcher (a contributor) has expressed his worries:
https://twitter.com/ykilcher/status/1685303009629069312

Anne Baanen (Aug 24 2023 at 10:04):

Some more previous discussions about licensing:


Last updated: Dec 20 2023 at 11:08 UTC