Zulip Chat Archive

Stream: Equational

Topic: Authorship, Security, and Red Teaming Project vulnerabil...

Shreyas Srinivas (Oct 05 2024 at 19:27):

Terence Tao said:

As long as the CI time is still on the order of minutes instead of hours, I'd say it's worth it to have the extra guarantee. (Note to future self reading this thread: when writing the final paper, make a note of the extra layers of checking we have beyond the core Lean compiler, for instance we also have a separate check that no extra axioms are being inserted (equational#242).)

It might be good to "red team" and brainstorm for any other potential ways in which this project could be compromised, either by a glitchy AI or by some mischevious human.

I think this message still belongs in this thread.

Shreyas Srinivas (Oct 05 2024 at 20:37):

It might help to set the baseline expectation : Assuming that Lean has a bg-free sound kernel, in theory the build + external checkers are all we need.

In practice any Cybersecurity issue anywhere in the GitHub infrastructure can become a source of bugs. Further human errors in setting up of the CI, lack of proper PR reviews, lack of checks to look for suspicious modifications, especially to the CI, can upset the apple cart.

Shreyas Srinivas (Oct 05 2024 at 20:38):

Further, one relatively underexplored aspect of large scale collaborations is publication authorship

Shreyas Srinivas (Oct 05 2024 at 20:39):

If a 100 people collaborate on a project, at least in my CS brain, that's 100 authors on the paper. Now traditionally authorship means making technical contributions, writing something, and taking responsibility for what is written in the paper, but this makes less sense for larger projects.

Shreyas Srinivas (Oct 05 2024 at 20:41):

It just doesn't scale.

Shreyas Srinivas (Oct 05 2024 at 20:42):

No one author can possibly be sure that every contribution is 100% kosher.

Shreyas Srinivas (Oct 05 2024 at 20:45):

In this project, there are PRs that add upwards of 100,000 lines of code. I can only be sure that they have been properly included and build. But scale it up to 500 people and an army of maintainers and suddenly even that is impossible for any individual.

Shreyas Srinivas (Oct 05 2024 at 20:46):

For example, what if one maintainer lets in lean code which has not been imported in the top level file and so never gets checked by CI

Shreyas Srinivas (Oct 05 2024 at 20:49):

This is just a long winded way of saying that even simple problems become extremely challenging at scale

Terence Tao (Oct 05 2024 at 21:32):

I guess it is about time to start discussing authorship. In the Polymath projects, we ended up selecting a pseudonym "D.H.J. Polymath" to author the published papers, with a link to a page listing the participants (which was collected by voluntary self-selection from the contributors' part) and their affiliations and grant support, although in one or two cases the journals did not permit pseudonymous authorship and we had to revert back to selecting some list of primary authors.

I was leaning towards a similar model for this project, having the report authored by the "Equational Theories Project" as a whole, and have some listing somewhere of participants. I guess there could be various levels of participants though - perhaps not everyone who contributed would wish to vouch for the accuracy of the report (which I presume we would collaboratively write, either on the same github repository, or on some other platform such as Overleaf or Dropbox). So perhaps only a volunteer subset of the participants would end up authoring the report. There is some precedent for this, for instance Neil Sloane has a number of articles (e.g., https://www.ams.org/journals/notices/201809/rnoti-p1062.pdf http://neilsloane.com/doc/HIS50.pdf ) about the OEIS, without listing every single OEIS contributor as co-author.

Terence Tao (Oct 05 2024 at 21:37):

Certainly I would imagine that a discussion of scaling issues would be an important part of any report on this project, and some of it would be presenting the issue as an unsolved problem rather than asserting that our platform is the perfect solution for all scales and all use cases.

Shreyas Srinivas (Oct 05 2024 at 21:47):

Much of the software infrastructure we are using is already known. What I think needs to be conveyed are "how are we approaching the contribution process in a way that enhances trust, and respects the effort of contributors. How the automation works. How we designed the project to work as automatically as possible, how we adapted to automated provers, etc. although people on this Zulip already know this, the broader math community might be interested in an explanation of how "trust" works.

Mario Carneiro (Oct 05 2024 at 21:47):

I recall a talk about how the "D.H.J Polymath" thing was a bad idea, as it ended up in practice centralizing the attention on the big names leading the project

Mario Carneiro (Oct 05 2024 at 21:48):

I don't see a huge difference between putting an author like that and just not crediting people

Mario Carneiro (Oct 05 2024 at 21:48):

the listing was apparently also not a very permanent record, not everything from the blog is still available

Shreyas Srinivas (Oct 05 2024 at 21:49):

Also, in computer science, we give authorships a bit broadly. Any technical contribution that goes beyond reviewing the paper and making broad suggestions generally merits authorship. A lot of algorithms theory students get their first papers by proving a few lemmas. An implementation is also a contribution.

Shreyas Srinivas (Oct 05 2024 at 21:51):

About trust: This is not just about trusting Lean's kernel or adding an environment checker. It is also about how the contribution process is managed. How PRs ought to be reviewed, what sort of human errors can mess things up etc.

Terence Tao (Oct 05 2024 at 21:51):

Yeah that is a problem, we don't have a permanent hosting for the polymath wiki. One could go down the route of the sciences with lead authors, secondary authors, etc.; I am wholly unfamiliar with that culture though, and am vaguely uncomfortable about it, though perhaps I can get used to it.

Mario Carneiro (Oct 05 2024 at 21:51):

The mathlib paper had a similar dilemma, but at least the real author list was given in an appendix rather than an external source that can bitrot

Shreyas Srinivas (Oct 05 2024 at 21:52):

In electrical engineering papers, we have to describe each author's contributions briefly these days

Mario Carneiro (Oct 05 2024 at 21:52):

I think if mathematics is actually doing big projects with many people, then it should have papers with many people too, that's just reflecting the reality of the situation

Mario Carneiro (Oct 05 2024 at 21:53):

certainly it's a change from the status quo, but so is interactive theorem proving

Terence Tao (Oct 05 2024 at 21:54):

Okay, so the proposed model would be a paper authored by a large number of people (listed under their real names), with an appendix describing each author's affiliation, grant support, and contribution to the project (including, for instance, whether they actively helped author the report)? That sounds reasonable to me.

Pietro Monticone (Oct 05 2024 at 21:55):

Another best practice for big, collaborative, open-source projects is including a code of conduct.

We can easily generate one from here, but the project manager / repository owner should decide on the best contact method.

The link to the #Equational channel could be enough. Maybe with an additional note "Feel free to contact the repository owner in direct messages" or something like that.

Terence Tao (Oct 05 2024 at 21:56):

One drawback by the way from the Polymath model was that many contributors (including myself) were not able to formally include the Polymath papers in their list of papers for internal university review purposes due to authorship rules at their home institution.

Mario Carneiro (Oct 05 2024 at 21:57):

and of course this issue is even more devastating for early-career researchers

Terence Tao (Oct 05 2024 at 22:00):

The standard code of conduct linked to sounds fine to me, and to have both myself (as repository owner) and the #Equational channel as points of contact.

Shreyas Srinivas (Oct 05 2024 at 22:01):

The institutional challenges in recognising contributions to such projects is another challenge.

Shreyas Srinivas (Oct 05 2024 at 22:02):

Currently academic hiring doesn't seem badly affected by the trend to use AI filters on resumes (and google scholar profiles) at least in theoretical fields.

Shreyas Srinivas (Oct 05 2024 at 22:03):

But the fact that the best record of a person's publications are their orcid or google scholar means that contributions will have to be recognised more broadly, not only on the project side, but also on the institutional side.

Shreyas Srinivas (Oct 05 2024 at 22:04):

and any hurdles that archaic publication criterion add to this is detrimental to collaborations on a large scale.

Pietro Monticone (Oct 05 2024 at 22:08):

Terence Tao said:

The standard code of conduct linked to sounds fine to me, and to have both myself (as repository owner) and the #Equational channel as points of contact.

Done https://github.com/teorth/equational_theories/pull/338

Shreyas Srinivas (Oct 05 2024 at 22:12):

The code of conduct sets up a puzzle, one that I hope never arises, but one that we should have a way of handling. Let us say an individual is permanently banned from the project. What happens to their contributions? Are the contributions expunged? Is that even tractable? Or are the contributions kept and the individual still gets authorship which requires them to interact with other authors, which might put off other authors from wanting to join in on the publication.

Terence Tao (Oct 05 2024 at 22:17):

If we permit papers authored by a subset of the participants, with an appendix listing the contributions of the participants, then I could imagine the uncomfortable, but perhaps viable, solution of having the banned contributor not listed as one of the paper authors, and not permitted to participate in the paper writing process, but be listed in the appendix as a "former" contributor. This doesn't solve the problem of what to do with the intellectual property though. I selected an Apache 2.0 licence for this project, but am not expert in the finer details of this licence, and am not sure to what extent contributors retain any claim to their contributions in such a scenario.

Terence Tao (Oct 05 2024 at 22:36):

For want of a better idea I proposed this scenario to GPT. It said that legally, contributions to an Apache 2.0 licence were irrevocable (which agrees with my untrained reading of the licence) and so there were no legal issues with continuing to use data from a former contributor, but that from an ethical or reputational standpoint there would be concerns, and recommended that in such situations the project make some announcement clarifying their policy on handling such contributions going forward. Hopefully we never have to deal with such a hypothetical scenario in the first place, but I guess I did ask for "red teaming" so this is a valid point to discuss.

Terence Tao (Oct 05 2024 at 22:41):

My initial reaction to this hypothetical would be: if the contributions were severable from the rest of the project with little or medium effort, then one should probably do so, but if the contributions became integral to large portions of the project, to the point where severability is infeasible, then we would have to continue using the contributions, but announce that we are doing so and explain our rationale. But I think each incident would have to be discussed on this forum on a case by case basis (and I sincerely hope and expect that the number of incidents that arise in this manner is 0).

Terence Tao (Oct 05 2024 at 22:47):

While on the topic of hopefully completely hypothetical scenarios: if for some reason I become incapacitated and am no longer able to work on this project, I explicitly grant permission for the project to be forked and continued under different ownership.

Shreyas Srinivas (Oct 05 2024 at 23:08):

I'll create a separate thread for these messages and keep this thread clean. I will also ensure that your reminder message stays in this thread in some form

Notification Bot (Oct 05 2024 at 23:10):

38 messages were moved here from #Equational > Thoughts and impressions thread by Shreyas Srinivas.

Amir Livne Bar-on (Oct 06 2024 at 05:07):

There was an interesting article about authorship models a while ago
https://elifesciences.org/labs/b86daa1d/author-contributions-recognising-researchers-for-the-work-they-do

Especially the film-credits one feels like something we could draw inspiration from.

Terence Tao (Oct 06 2024 at 05:35):

That does sound appealing to me. Many of the 14 standard CRediT categories listed in https://docs.google.com/document/d/1aJxrQXYHW5U6By3KEAHrx1Iho6ioeh3ohNsRMwsoGPM/edit#heading=h.bkwyjpjjzcq0 seem to be reasonably applicable for this project.

Daniel Weber (Oct 06 2024 at 07:11):

Shreyas Srinivas said:

For example, what if one maintainer lets in lean code which has not been imported in the top level file and so never gets checked by CI

We should have CI making sure all files are imported in equational_theories, like in Mathlib (perhaps slightly more sophisticated, allowing files in Generated to be only imported transitively through Generated.lean)

Eric Wieser (Oct 06 2024 at 08:18):

Or use .submodules in the lakefile so that no such top-level file is required and everything is always built

Shreyas Srinivas (Oct 06 2024 at 09:28):

Terence Tao said:

That does sound appealing to me. Many of the 14 standard CRediT categories listed in https://docs.google.com/document/d/1aJxrQXYHW5U6By3KEAHrx1Iho6ioeh3ohNsRMwsoGPM/edit#heading=h.bkwyjpjjzcq0 seem to be reasonably applicable for this project.

I would suggest that any authorship system that deviates too far from the standard form is a bit risky for young researchers.

Shreyas Srinivas (Oct 06 2024 at 09:30):

Eric Wieser said:

Or use .submodules in the lakefile so that no such top-level file is required and everything is always built

Upto the top level this is fine, but does it help with similar problems deeper in the hierarchy?

Anyway, the reason I brought this up is to illustrate how human error can complicate matters

Shreyas Srinivas (Oct 06 2024 at 11:14):

Shreyas Srinivas said:

Terence Tao said:

That does sound appealing to me. Many of the 14 standard CRediT categories listed in https://docs.google.com/document/d/1aJxrQXYHW5U6By3KEAHrx1Iho6ioeh3ohNsRMwsoGPM/edit#heading=h.bkwyjpjjzcq0 seem to be reasonably applicable for this project.

I would suggest that any authorship system that deviates too far from the standard form is a bit risky for young researchers.

One issue here is, many people's names will appear in multiple places. This might clutter up the name list and result in the same situation as polymath, where non famous authors are ignored.

Terence Tao (Oct 06 2024 at 15:04):

What about having a standard alphabetic listing for the "official" author list, but a film credits-like listing of multiple subsets of the authors in an appendix?

Shreyas Srinivas (Oct 06 2024 at 15:19):

Yeah that sounds better

Terence Tao (Oct 06 2024 at 17:40):

I just started a Wiki page at https://github.com/teorth/equational_theories/wiki/Plan-of-paper to begin planning out the paper arising from this project - this is still several weeks away at least from being a reality, but I guess it is worth starting a discussion on it now. I threw together some possible section headings and possible topics, but perhaps there is more to put in there (or some stuff to take out or move to a separate location). There's actually quite a lot to talk about, I think! Anyway, we can use this thread to discuss the structure of the paper, and I will try to keep this wiki document updated with the latest consensus.

Shreyas Srinivas (Oct 06 2024 at 17:55):

I can volunteer to (co)write the project management section. I see it as analogous to an experience report in some verification conferences, but we also need to explain our contribution workflow and the design considerations behind it. The other natural co-author for this section is @Pietro Monticone

Shreyas Srinivas (Oct 06 2024 at 17:57):

Although these tools are already used in software engineering, a lot of the subtleties need to be explained to the math audience. Further human factors of trust and use of external checkers will also have to be explained.

Terence Tao (Oct 06 2024 at 17:59):

OK, I've made a note on this. By the way, what interactivity is there on this wiki - am I the only one that can make or propose edits? Is there some permissions I can grant to give wiki edit privileges to potential contributors?

Pietro Monticone (Oct 06 2024 at 18:03):

@Terence Tao, at the moment it's restricted to collaborators, but you can go to Settings > Features > Wikis and remove the restriction.

Screenshot 2024-10-06 at 20.01.41.png

As far as I know GitHub doesn't allows to restrict Wikis to contributors defined as users with at least a merged PR.

Shreyas Srinivas (Oct 06 2024 at 18:10):

Removing that restriction might allow third parties to spam right?

Shreyas Srinivas (Oct 06 2024 at 18:10):

Meaning people not connected to the project in any way or bot accounts?

Pietro Monticone (Oct 06 2024 at 18:11):

Yes, that's exactly the problem unfortunately.

Daniel Weber (Oct 06 2024 at 18:17):

Would something like Mathlib, where there are many collaborators but pushing to main is restricted, work?

Terence Tao (Oct 06 2024 at 18:22):

Hmm, it seems that my free version of Github might not even have the Wiki management options in the first place. That's fine, we can use some other platform such as Overleaf to coordinate the actual writing, and have the wiki page just for the planned outline.

Pietro Monticone (Oct 06 2024 at 18:25):

Daniel Weber said:

Would something like Mathlib, where there are many collaborators but pushing to main is restricted, work?

For we would need the Admin, Maintain, Write and Read roles that are available for organisation accounts and enterprise individual accounts as far as I know.

Andrés Goens (Oct 06 2024 at 18:28):

@Terence Tao FWIW, you can probably get a pro version of github for free for having an academic affiliation https://github.com/education/teachers

Andrés Goens (Oct 06 2024 at 18:28):

(could still be useful even if not just for the wiki)

Terence Tao (Oct 06 2024 at 18:30):

I'll keep in mind. I think this project is close to the upper bound of what can be reasonably managed with a non-enterprise individual Github account; if it significantly expands, or if I (perhaps foolishly) attempt to launch an even larger project, I will definitely look into enterprise options (at this point I think I would seek funding to hire an administrator to handle all the backend issues).

Terence Tao (Oct 06 2024 at 18:34):

I am running this project with $0 funding currently. My thought was to have this be the "small" pilot project that one could then use as a model for myself or others to propose larger projects that may require actual paid labor and services (and compute). It became larger and faster than I had imagined, but I think still manageable for now with an all-volunteer system. But probably not if it scales up another 2x or so.

Bryan Gin-ge Chen (Oct 06 2024 at 18:41):

It should be possible to create a free organization on GitHub and then transfer the Equational Project repository to it to get the specialized roles.

This of course comes with some wrangling of organization membership and roles and such, but if you want to organize code / papers into separate repos at some point in the future it may be worth the effort. Note that even the leanprover-community organization has been able to get by on the free plan so far; so far as I can tell the difference between paid and free is that private repos are somewhat limited in the free version.

Terence Tao (Oct 06 2024 at 18:50):

Would this lead to any disruption on the contributor end (e.g., a new URL for the repository)? If it is seamless to transition, I could try this, but if it would mean everybody has to download a new repo or something then I'm inclined to wait until something is actually close to breaking to execute this.

Terence Tao (Oct 06 2024 at 19:00):

Looks like there could be some non-trivial disruption and/or technical issues in transferring. I think we will just muddle through with the current setup, but I've added a note in the wiki that for future projects, planning an organizational structure (and setting up the repository appropriately) in advance of the launch may be advisable.

Shreyas Srinivas (Oct 06 2024 at 19:13):

GitHub projects can be used to manage multiple repositories

Terence Tao (Oct 06 2024 at 19:14):

Ah, so we could potentially do the writing in a separate repo managed by the same project, with many collaborators. That could work

Bryan Gin-ge Chen (Oct 06 2024 at 19:18):

Regarding disruption: when you transfer a repo, github sets up redirects so that pushing / pulling will work with the old URLs, but unfortunately it seems that getting Github pages to redirect properly requires some more work (see e.g. these instructions).

Terence Tao (Oct 06 2024 at 19:20):

Yeah I think the option of setting up a second repo for writing if necessary, but leaving the (currently perfectly functional) repo for the scientific portion of the project untouched seems like the best solution.

Last updated: May 02 2025 at 03:31 UTC