Zulip Chat Archive

Stream: Zulip meta

Topic: Many archive links are broken


Malcolm Langfield (Aug 22 2023 at 13:42):

A great many of the links to specific topics on the zulip chat archive are broken. In particular, all of them that start with the little checkmark seem to 404, I believe because of the leading URL encoding escaping. A possibly related github issue exists here.

Do these pages exist on the archive site anywhere? If so does anyone know how to patch the URL so it goes through? Alternatively, is it possible for an ordinary user to regenerate the archive with a fixed version of the script in the github link?

Example: https://leanprover-community.github.io/archive/stream/113488-general/topic/.E2.9C.94.20.60rw.20.5Bset.2Epreimage_comp.5D.60.20sees.20.60id.60.20everywhere.html

Scott Morrison (Aug 22 2023 at 23:46):

My limited understanding of this archive is that since zulip allowed us to make stream web-public, the archive has been abandoned.

Eric Wieser (Aug 23 2023 at 05:44):

The archive cron job has been abandoned, but it's still potentially useful for Google indexing

Eric Wieser (Aug 23 2023 at 05:44):

It's easy to run on request through the GitHub UI

Malcolm Langfield (Aug 23 2023 at 17:06):

It is extremely useful for Google indexing, and also for searching Zulip more precisely offline (the html pages are easily converted to greppable text). Here's one vote for bringing the cron job back, hah.

As an example:

site:leanprover-community.github.io "refine"
site:leanprover.zulipchat.com "refine"

These two google queries give vastly different results (and in fact, the second one gives none).

Who's in charge of that?

Eric Wieser (Aug 24 2023 at 15:48):

It's not quite as simple as "bringing the cronjob back"; I rewrote the archive repo, and never wrote a cronjob for the new version.

Eric Wieser (Aug 24 2023 at 15:49):

If you want to search Zulip offline (a pretty niche use-case), you can just run the script yourself locally just before you go offline

Joachim Breitner (Aug 24 2023 at 20:00):

Isn’t it a github action workflow? In that case, just adding something like

on:
  schedule:
    # * is a special character in YAML so you have to quote this string
    - cron:  '30 5 * * *'

would work for a nightly run, woudn’t it?

Rob Lewis (Aug 24 2023 at 22:15):

The archive is really an abuse of GitHub Pages. It's a massive number of pages, each one of which gets updated regularly. (Not changing the "last updated" timestamp would help somewhat here.) When we were running it regularly I think we didn't save the history each time to avoid massive diffs, but this was still really bad on GitHub's side, to the point where they asked us nicely to stop.

Rob Lewis (Aug 24 2023 at 22:17):

If we ever consider bringing back regular runs of the script, we should look into a proper hosting solution.

Rob Lewis (Aug 24 2023 at 22:22):

From January: "The current version of the repository is only 122 MB, but the logical size has grown to over 50 GB, which is far larger than our recommended 5 GB maximum."

Rob Lewis (Aug 24 2023 at 22:24):

That said, the proper way to handle this isn't to re-host the archive, it's to tackle https://github.com/zulip/zulip/issues/21881

Joachim Breitner (Aug 25 2023 at 02:14):

Heh, fair enough. Was it the JSON files or the HTML output that was overwhelming GitHub pages?

Mario Carneiro (Aug 25 2023 at 02:19):

the HTML pages, but more generally our usage of a git repo for not-version-control

Mario Carneiro (Aug 25 2023 at 02:19):

IIRC the JSON is a lot smaller than the HTML

Joachim Breitner (Aug 25 2023 at 02:32):

If the abuse is ok for the JSON (I have abused GitHub for such data storage before, so I sympathize with that approach), then maybe using netlify for the JSON-to-HTML step is a good option. I have moved GitHub pages sites to netlify with good results before, and their deploy-from-git model is a good fit. Happy to lend a hand if there is appetite for that. (Even if Zulip becomes web searchable, I still find the other arguments for the static archive laid down by Rob in the readme convincing :-)).

Joachim Breitner (Aug 25 2023 at 02:47):

But maybe the problem has already been solved by Eric's fork, where the action (https://github.com/eric-wieser/zulip-archive/blob/master/compound/action.yml) is deploying the HTML to GitHub pages directly, without storing it in a git repo? (This is a relatively receny feature by GitHub)

Eric Wieser (Aug 25 2023 at 08:21):

Yes, the updated archive fork was created to address the complaints from github

Eric Wieser (Aug 25 2023 at 08:21):

It stores the json in the git repo, but not the html pages

Eric Wieser (Aug 25 2023 at 08:25):

Right now there are two possible cron jobs:

  • Locally generate the json files (incrementally), and regenerate the HTML
  • Locally generate the json files (incrementally), commit them, then regenerate the HTML

Eric Wieser (Aug 25 2023 at 08:25):

One loads the Zulip servers and github actions more, the other inflates the git repo more


Last updated: Dec 20 2023 at 11:08 UTC