Zulip Chat Archive

Stream: general

Topic: zuilp archive bug


ZHAO Jiecheng (Dec 20 2023 at 03:08):

The bug: Topics that start with non-ASCII characters like "✔ ∑' n, n * r ^ n (11 messages, latest: Jul 28 2023 at 13:58)" have inaccessible links in the archive (resulting in a 404 error)

I've found it helpful to discuss Zulip posts with ChatGPT when I'm confused. However, ChatGPT can't directly read Zulip, but it can read the Zulip archive. Unfortunately, there's a bug in the archive, and it's also slightly outdated. I attempted to troubleshoot by checking https://github.com/leanprover-community/leanprover-community.github.io but couldn't resolve it. (I can not find the code do the achieve.) Can anyone assist?

Scott Morrison (Dec 20 2023 at 04:04):

Unfortunately I think now that zulip has web-accessible streams, no one is maintaining the archive anymore.

ZHAO Jiecheng (Dec 20 2023 at 06:15):

Scott Morrison said:

Unfortunately I think now that zulip has web-accessible streams, no one is maintaining the archive anymore.

Unfortunately, even with public access, ChatGPT still can not render the page and read the content...

ZHAO Jiecheng (Dec 20 2023 at 07:50):

@Eric Wieser could you please initiate the archive procedure? And maybe give me some clue to fix the bug. "Topics that start with non-ASCII characters like "✔ ∑' n, n * r ^ n (11 messages, latest: Jul 28 2023 at 13:58)" have inaccessible links in the archive (resulting in a 404 error)"

ZHAO Jiecheng (Dec 20 2023 at 08:23):

Another question: Why the last commit of https://github.com/leanprover-community/archive is in Feb but the last achieve time is in Aug? Shouldn't the contents in github.io be the contents in the repo?

Eric Wieser (Dec 20 2023 at 08:58):

ZHAO Jiecheng said:

Another question: Why the last commit of https://github.com/leanprover-community/archive is in Feb but the last achieve time is in Aug? Shouldn't the contents in github.io be the contents in the repo?

No, this is deliberate as it avoids using more git storage

Eric Wieser (Dec 20 2023 at 08:59):

The CI job uses the repo as a starting point and downloads messages that are later than it

Eric Wieser (Dec 20 2023 at 08:59):

There is a checkbox when running CI for whether to update the source repo afterwards

Eric Wieser (Dec 20 2023 at 09:00):

I think there is a known bug with url encoding in the archive, but changing the encoding scheme would break any existing links to the archive

ZHAO Jiecheng (Dec 20 2023 at 09:42):

Eric Wieser said:

I think there is a known bug with url encoding in the archive, but changing the encoding scheme would break any existing links to the archive

For pages with broken links, it's typically safe to update their URL encoding. The main concern is to ensure that the functional parts of the links remain the same while correcting the encoding problems.

Eric Wieser (Dec 20 2023 at 09:43):

Right, but this rules out updating to a new version of the archive software upstream, which both fixes the bug and changes the scheme

ZHAO Jiecheng (Dec 20 2023 at 09:46):

Eric Wieser said:

Right, but this rules out updating to a new version of the archive software upstream, which both fixes the bug and changes the scheme

Do you mean the update of https://github.com/zulip/zulip-archive ?

Eric Wieser (Dec 20 2023 at 09:47):

Yes, we are using a much older version hosted at my fork

ZHAO Jiecheng (Dec 20 2023 at 09:49):

Eric Wieser said:

Yes, we are using a much older version hosted at my fork

Do you mean this one https://github.com/eric-wieser/zulip-archive ? Maybe I can upgrade it.

Eric Wieser (Dec 20 2023 at 09:56):

Upgrading it breaks all the links

Utensil Song (Dec 20 2023 at 10:53):

Maybe it's viable to change the existing links in the archive to the new scheme?

Eric Wieser (Dec 20 2023 at 11:07):

I've kicked off a CI job to update the archive data

ZHAO Jiecheng (Dec 20 2023 at 11:35):

Eric Wieser said:

Upgrading it (and pulling in https://github.com/zulip/zulip-archive/pull/85) breaks all the links

Can we leave redirect pages from the current url to new one?

Eric Wieser (Dec 20 2023 at 11:53):

What if the a topic url under the old scheme collides with a different topic under the new scheme?

Eric Wieser (Dec 20 2023 at 11:54):

This is all a iot of annoying work to sort out, for increasingly little gain

Kim Morrison (Dec 20 2023 at 12:01):

Good time to declare no further maintenance. :-)

Eric Wieser (Dec 20 2023 at 12:05):

I'm very happy to click the "regenerate the archive" button when people ask, but actually changing the scripts is probably not worth it

Utensil Song (Dec 20 2023 at 12:29):

Maybe Morph Labs would be more motivated to maintain the archieve? As they train on Zulip data.

Eric Wieser (Dec 20 2023 at 12:35):

That doesn't give them any motivation to preserve existing links

Eric Wieser (Dec 20 2023 at 12:36):

I suppose we could host a newarchive that runs on the latest version of the zulip archive software, which avoids any URL collisions

Eric Wieser (Dec 20 2023 at 12:36):

But we'd need to upstream our github action changes in order to avoid github usage limits

ZHAO Jiecheng (Dec 21 2023 at 00:48):

Utensil Song said:

Maybe Morph Labs would be more motivated to maintain the archieve? As they train on Zulip data.

Sounds like I may not need ChatGPT + Zuilp Achive in the future any more.

Soonho Kong (Mar 13 2024 at 18:44):

Hi @Eric Wieser , I asked a similar question in today's Lean FRO office hour and @Joachim Breitner shared this topic with me.

Originally, I wanted to know if this public archive can be updated regularly (e.g. monthly). But I'd like to know if I can run a script to get the archive locally. Does it require special permission from Zulip?

Eric Wieser (Mar 13 2024 at 19:13):

You need to register a zulip bot on this instance, but I believe any user has access to that in their settings

Soonho Kong (Mar 13 2024 at 20:13):

I see. I'll try it then.


Last updated: May 02 2025 at 03:31 UTC