Zulip Chat Archive
Stream: general
Topic: Indexing Zulip
Rob Lewis (Jan 20 2019 at 21:36):
At the meeting in Amsterdam, someone raised the question of archiving the Zulip chat and making it accessible to search engines. It's hard to find answers to Lean questions online right now. You have to register and search here specifically.
I wrote a little script that will scrape the public streams here and post them to an openly accessible site. You can see a sample here: https://leanprover-community.github.io/archive
Before I spend more time on this, I wanted to check with people -- are there any objections on privacy grounds? I'm displaying strictly less information than is available openly to anyone who registers for Zulip -- no email addresses, no pictures. To me it seems comparable to the archive of an email list, which is often public and indexed.
If there are no objections, I'll finish the script and put up a whole archive. New posts will be added every time the script runs. I don't have a reliable server to automate this; my plan was just to run the script when I think about it, since there's nothing time sensitive, but if anyone has a better plan I'm listening. I also don't have the time to waste on web design. So if anyone wants to mess with the css, Jekyll setup, etc, let's talk.
Kevin Buzzard (Jan 20 2019 at 22:58):
Given that many of us were saying pretty much the same things on gitter a few months back you'd think people will be ok with this. I certainly am.
Neil Strickland (Jan 20 2019 at 23:14):
This is a good interim step, but I'll repeat what I said in Amsterdam: I think that stackexchange is an extremely good framework for recording questions and answers, and I would like to see these kind of things move either to stackoverflow (where there are currently about 3000 questions on coq and/or isabelle, and 48 on lean) or to a new site focused on proof assistants.
Christopher Sumnicht (Jan 21 2019 at 02:24):
Maybe someone should propose a computer verification/lean/proof assistant site on Area51?
Kenny Lau (Jan 21 2019 at 02:25):
I don't think there's enough interest for that...
Scott Morrison (Jan 21 2019 at 02:45):
@Rob Lewis I'm happy to run the script as a cron job on a machine that is (nearly) always online. (Also, thanks, this is great.)
Rob Lewis (Jan 21 2019 at 08:33):
I think this chat serves a more general purpose than Stack Overflow. There's lots of general discussion here that doesn't really happen in that format. Trying to direct questions that way and discussions this way would be counterproductive. Of course, I don't want to discourage SO use either, I've answered a few questions there when they've shown up.
Rob Lewis (Jan 21 2019 at 08:33):
@Scott Morrison That would be perfect, let's talk.
Scott Morrison (Jan 21 2019 at 21:44):
@Neil Strickland, @Rob Lewis one potential compromise is to ask post-hoc questions on SE! SE doesn't particularly mind if the same person asks and answers a question, so it can be a cheap way to preserve a discussion here in more searchable form.
Scott Morrison (Jan 21 2019 at 21:44):
This is also something one can potentially push students to do.
Rob Lewis (Jan 22 2019 at 23:16):
I've made some progress here. There's a full archive of this chat, as of some time earlier today: https://leanprover-community.github.io/archive/
You guys like to use Zulip in ways that don't play well with Jekyll. There are a few outstanding issues:
- Lean syntax highlighting. This is a pipeline issue: there's a PR to Rouge that needs to be merged, and then Github needs to update.
- Parsing urls. Zulip's markdown parser is different from the options in Jekyll, which will only convert urls to links if they're enclosed in < >. Oh well, I'm not going to fix this.
- Malformed posts. Some of you like to write posts that end with code blocks and leave off the closing backticks (https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/invalid%20%27calc%27%3A%20transitivity%20rule/near/125130159), which leads to this: https://leanprover-community.github.io/archive/113488general/90773invalidcalctransitivityrule.html I'm not sure what to do here.
It's possible to let Zulip process the markdown instead, but then I need to write css to match it's output, and I hate writing css.
Rob Lewis (Jan 22 2019 at 23:17):
Click through a few links and see if you notice anything that looks wrong.
Rob Lewis (Jan 22 2019 at 23:17):
Once it seems stable, I'll get automatic updating set up on Scott's server.
Scott Morrison (Jan 22 2019 at 23:36):
There's no way to click back to the main table of contents. "Lean Prover Zulip Chat Archive" could be a hyperlink.
Just saw the "Zulip archive" link on the right.
Scott Morrison (Jan 22 2019 at 23:38):
On each stream page, there should be a divider between the stream name and the list of topics. Otherwise the stream name itself is indistinguishable from the topics, and clicking on it mysteriously does nothing. (Different styling could solve this.)
Scott Morrison (Jan 22 2019 at 23:39):
Next to each topic link, it might be nice to write, e.g "17 messages, most recent 2019-01-23" or similar, but this isn't essential.
Kenny Lau (Jan 22 2019 at 23:39):
@Scott Morrison I like how you don't just say what shouldn't be done, but also say what should be done. I think this is a very effective way to suggest problems.
Scott Morrison (Jan 22 2019 at 23:40):
It might be nice, on each topic, to include a link directly to that topic on zulip. (Just in case people land on the searchable archive page from a search, and want to continue the discussion.)
I see now that the heading of each message is actually a link to that message on zulip. This is probably sufficient. Perhaps ideal would be to instead display the little zulip "Z" icon next to the message heading, and just link that? Then it's clear without reading the URL what clicking will do.
Scott Morrison (Jan 22 2019 at 23:41):
Is it plausible to identify hyperlinks in the body of messages, and linkify them?
Scott Morrison (Jan 22 2019 at 23:43):
The link to the archive on https://leanprover-community.github.io/ 404s.
(It should link to https://leanprover-community.github.io/archive/, not https://leanprover-community.github.io/archive.html)
Rob Lewis (Jan 22 2019 at 23:44):
The page title "Lean Prover Zulip Chat Archive" is inserted differently than the others, and it wasn't straightforward to make it a link. I agree it should be one though.
Scott Morrison (Jan 22 2019 at 23:44):
On the main page https://leanprover-community.github.io/archive/, how about we write a title "Streams" above the list of streams? At first it isn't obvious what this list is doing.
Rob Lewis (Jan 22 2019 at 23:45):
I like the idea for counting messages next to the topic links, and I think that's doable.
Scott Morrison (Jan 22 2019 at 23:45):
Similarly on each stream page, we might write the heading "Topics:" above the list.
Scott Morrison (Jan 22 2019 at 23:46):
Halfway down https://leanprover-community.github.io/archive/113538travis/26562buildstatus.html something gets borked in the formatting. Not exactly sure what, some unclosed markdown environment?
Rob Lewis (Jan 22 2019 at 23:47):
I couldn't make Jekyll linkify urls, and the alternative is writing something in Python to identify them and enclose them in <> (but not when they're already part of markdown links). It's possible, I guess, but not a priority.
Rob Lewis (Jan 22 2019 at 23:47):
Your other suggestions are all good and easy!
Rob Lewis (Jan 22 2019 at 23:48):
Yeah, the formatting in the build status thread is because someone forgot the closing backticks.
Scott Morrison (Jan 22 2019 at 23:49):
The way you display quotations puts everything into fixed-width font. It's slightly confusing (because it looks like a code block), but not that important. See https://leanprover-community.github.io/archive/113489newmembers/10902lagrangetheorem.html for an example.
Scott Morrison (Jan 22 2019 at 23:49):
Otherwise, this is great! Having this all google-able is important.
Rob Lewis (Jan 22 2019 at 23:51):
Quotes prefixed with >
get parsed differently (and displayed better) than ones in backticks. Probably fixable once I have the time and will to dive into the css.
Patrick Massot (Jan 23 2019 at 10:05):
Thank you very much Rob! I'm not sure it's a good idea to include all streams in this archive, because it clutters the main list. I think only general, maths and new members should be there. Or do you think people will only get there through Google, hence it doesn't matter?
Patrick Massot (Jan 23 2019 at 10:06):
Is this website intended to also host other mathlib related information? I wanted to try to have a mathlib documentation website as well
Patrick Massot (Jan 23 2019 at 10:15):
Rob, is the crawling script somewhere on GitHub?
Patrick Massot (Jan 23 2019 at 10:16):
Is there any reason not to use the same Jekyll theme as on the main Lean website? It would be more consistent
Rob Lewis (Jan 23 2019 at 10:22):
I think there's interesting discussion that goes on in PR reviews, Lean Together 2019, etc. And there could be new channels in the future worth indexing. Maybe we should blacklist rss and travis instead of whitelisting general, math, and new members.
Rob Lewis (Jan 23 2019 at 10:22):
I wrote the home page in about 20 second, but yeah, it's a good place to put other mathlib info.
Patrick Massot (Jan 23 2019 at 10:23):
Would you mind giving me write access?
Rob Lewis (Jan 23 2019 at 10:23):
The crawling script is messy, unfinished, and only works with my Zulip bot api key which probably shouldn't be public.
Patrick Massot (Jan 23 2019 at 10:23):
Your api key clearly shouldn't be public. But the script could read it from somewhere
Rob Lewis (Jan 23 2019 at 10:24):
I have a ton of local changes right now, let me finish these before you start editing anything. But yeah, you can have access.
Rob Lewis (Jan 23 2019 at 10:24):
It does read it from somewhere but it's kind of useless without it.
Patrick Massot (Jan 23 2019 at 10:24):
There is no hurry at all.
Patrick Massot (Jan 23 2019 at 10:25):
People who want to play with it can get their own key
Rob Lewis (Jan 23 2019 at 10:25):
And for the theme, I picked the first "basic" theme I found, it shouldn't be too hard to reskin.
Bryan Gin-ge Chen (Jan 23 2019 at 12:53):
Another thing we might think about putting on the leanprover-community website is a version of the lean web editor that uses a current build of mathlib / lean. I'd want to figure out why the emscripten builds of lean don't work in Firefox is before really pursuing this though.
Rob Lewis (Jan 23 2019 at 13:22):
I've made all the suggested changes that were easy. The latest addition added little Zulip icons next to the links to posts, but the Jekyll build on GitHub is timing out now. Not sure if that's a result of this change or what. I'll restart it in a bit.
Rob Lewis (Jan 23 2019 at 18:40):
For some reason, adding little Zulip icons in the markdown makes the Github build time out. It only adds about 5% locally. I added them with css, which is way better anyway.
Johan Commelin (Jan 23 2019 at 19:15):
@Rob Lewis How are the streams sorted on the home page of the archive?
Johan Commelin (Jan 23 2019 at 19:15):
I would expect kbb
to be down at the bottom, since it is no longer active.
Rob Lewis (Jan 23 2019 at 19:20):
Good question. I assumed it was by time of creation, newest first, since Lean Together is at the top. But that doesn't look right. The Zulip API docs imply it's by stream id, lowest first, but that's also wrong.
Rob Lewis (Jan 23 2019 at 19:20):
The answer: "whatever order Zulip gives them in."
Johan Commelin (Jan 23 2019 at 19:21):
How hard would it be to sort them by activity?
Rob Lewis (Jan 23 2019 at 19:21):
Date of creation, newest first and stream id, lowest first should be the same, and stream id is easy to sort by.
Rob Lewis (Jan 23 2019 at 19:23):
Slightly harder than by stream id. That also means the index page will potentially shuffle every time the archive is rebuilt, which seems a little wrong.
Johan Commelin (Jan 23 2019 at 19:23):
Hmmm, maybe it's a little wrong. It isn't very important anyway...
Mario Carneiro (Jan 24 2019 at 01:03):
How about number of messages? That should be pretty stable
Rob Lewis (Feb 03 2019 at 17:38):
I spent a bit more time today playing with this. It turns out, letting Zulip process the markdown solves some of the problems from before. No more formatting problems when people forget closing backticks, automatic linkification, quote formatting is more uniform.
Rob Lewis (Feb 03 2019 at 17:39):
One problem that gets introduced is that links to other Zulip posts are broken.
Rob Lewis (Feb 03 2019 at 17:39):
It returns them as relative links.
Rob Lewis (Feb 03 2019 at 17:41):
My impression is it's a net gain, but I don't have a good idea for solving the link issue. Check it out again and see if anything else looks funny: https://leanprover-community.github.io/archive/
Rob Lewis (Feb 03 2019 at 17:41):
(Also, I sorted the stream index page by number of topics, which should be about as stable and easier than number of messages.)
Rob Lewis (Feb 03 2019 at 17:53):
Also, I should check how Zulip sanitizes html. What happens if I write <img src="this.png"> ? (Exactly what should happen, good.)
Sebastian Ullrich (Feb 03 2019 at 17:58):
@Rob Lewis Does <base>
help?
Rob Lewis (Feb 03 2019 at 18:00):
Oh, yes, that should help a lot!
Rob Lewis (Feb 03 2019 at 18:40):
Alright, I think it's looking good.
Rob Lewis (Feb 03 2019 at 20:24):
Ignore this, I need a new post for testing purposes.
Johan Commelin (Feb 03 2019 at 20:26):
Ignore this, I need a new post for testing purposes.
Done
Johan Commelin (Feb 03 2019 at 20:28):
Ahw, we can't see all the cute emoji's below posts.
Rob Lewis (Feb 03 2019 at 22:01):
I actually do have access to reactions to posts, but decided it wasn't worth my time to figure out how to display them. :shrug:
Rob Lewis (Feb 04 2019 at 11:26):
Okay, I think the archive is ready for production. I'll talk to Scott about getting things set up on his server.
Sebastian Ullrich (Feb 04 2019 at 11:34):
Minor thing, but providing a link/anchor to a specific message in the archive may be nice
Rob Lewis (Feb 04 2019 at 11:37):
Good idea. I could make the Zulip logo link to Zulip, and the name/time an anchor link. Does that sound good?
Mario Carneiro (Feb 04 2019 at 11:40):
what about direct cross references? Like if I say https://leanprover.zulipchat.com/#narrow/stream/116395-maths/topic/floor_ring/near/157515775
Rob Lewis (Feb 04 2019 at 11:43):
That will always link to Zulip. When I noticed yesterday those links were always local, I thought maybe I could mimic the link structure in the archive and make them point to individual posts there. But I don't know how to interpret leanprover-community.github.io/#narrow/stream/116395-maths/topic/floor_ring/near/157515775 as a link to an anchor on a static page.
Mario Carneiro (Feb 04 2019 at 11:45):
You could change the links via JS or whatever your server code is
Rob Lewis (Feb 04 2019 at 12:17):
It's a static site built with Jekyll and hosted using GitHub Pages. As far as I can tell you don't have much control over how to direct arbitrary urls like that. Maybe there's a way, but my Google skills failed to find it.
Mario Carneiro (Feb 04 2019 at 12:26):
Looks like you can use custom JS in your site style, which can include some link modification on the spot
Rob Lewis (Feb 04 2019 at 12:31):
Oh, you mean modify the links as they're displayed. Uh, yeah, I guess I could do that with JS or in the Python script. It kind of seems like a huge pain. Right now I don't process the content of messages in any way.
Rob Lewis (Feb 04 2019 at 13:31):
https://leanprover-community.github.io/archive/113488general/45931IndexingZulip.html#157516605
Rob Lewis (Feb 04 2019 at 13:32):
I'll keep thinking about internal links, but it's more effort than I have time for at the moment.
Rein Zustand⚙ (Apr 03 2019 at 17:08):
!! I stumbled upon the indexer from a HN thread (https://news.ycombinator.com/item?id=19473134). The rust-lang community (https://rust-lang.zulipchat.com) has been looking for exactly this public archive, so I was trying to reuse the tool. Yesterday I tried it on chat.zulip.org for the initial full-mirroring, which took ~45 min, but had to subsequently mkdir
several directories and included several of the asset files. I think it would be super helpful to have the mirror script put in a separate repository that can be readily used by other projects.
Rob Lewis (Apr 03 2019 at 17:20):
The mirror script was mostly hacked together over the course of a couple train rides. As you noticed, it's not very polished! It's been on my to-do list for a while to clean up the script and make it independent of the rest of the repository. But more immediate things keep getting in the way. I haven't had a long train ride in a couple months, heh.
Rob Lewis (Apr 03 2019 at 17:20):
I'm guessing you had to mkdir
the _json
and archive
directories and create the file where it tracks the time of the last update?
Rein Zustand⚙ (Apr 03 2019 at 17:44):
Yeah, for more context, there was a prior art (https://github.com/zulip/zulip/pull/8135) but it was very complex and in the end, it didn't come through, and it was thought to be a tough project.
I'd say the remaining things to do should take about an hour or two.
Indeed, the empty dirs that have to exist prior to the start of the script are: archive
, _json
, _includes
(the last one is for archive_update.html, i.e. the time tracking). I haven't checked about the assets (svgs etc) that have to be included though.
Rob Lewis (Apr 03 2019 at 20:18):
That PR (or at least the linked proposal) is trying to do a lot more than my little script. I definitely do want to clean it when I have time. You're right that the directories issue needs to be fixed, and it can be factored out from the rest of the repo somewhat.
Rob Lewis (Apr 03 2019 at 20:19):
I'd hesitate to use it for an archive that expects a lot of traffic though. The formatting isn't particularly reliable, e.g. "continuity of 1/x" at https://leanprover-community.github.io/archive/116395maths/index.html
Rob Lewis (Apr 03 2019 at 20:20):
I'm sure a malicious user could come up with a stream name or user name that would cause trouble.
Rein Zustand⚙ (Apr 03 2019 at 22:41):
I'm sure a malicious user could come up with a stream name or user name that would cause trouble.
IIRC gh-page does a linting process that sanitizes the html files.
Rein Zustand⚙ (Apr 03 2019 at 22:44):
I'd hesitate to use it for an archive that expects a lot of traffic though. The formatting isn't particularly reliable, e.g. "continuity of 1/x" at https://leanprover-community.github.io/archive/116395maths/index.html
I can write a patch to do the clean up via html escape later (actually, the topic url is already escaped). The >90% majority of the topics still work, regardless.
Rob Lewis (Apr 04 2019 at 11:45):
@Rein Zustand:gear: I'll factor out the mirroring script when I have time, but just to warn you, I probably won't have time for a while. Until then, you're very welcome to use and change the script in any way you want!
Rein Zustand⚙ (Apr 04 2019 at 12:25):
OK, thank you for the heads up.
Rein Zustand⚙ (Aug 18 2019 at 21:01):
@Rob Lewis do you have the time to refactor the mirroring script? This time it is needed by the FHIR community (chat.fhir.org) for downloading all the messages for the purpose of traffic analysis.
Rob Lewis (Aug 18 2019 at 21:09):
It's been done: https://github.com/zulip/zulip_archive
Rob Lewis (Aug 18 2019 at 21:10):
The Rust people have been using it here: https://zulip-archive.rust-lang.org
Rein Zustand⚙ (Aug 22 2019 at 13:18):
Ah, thanks for the update that I missed.
James King (Nov 25 2019 at 17:19):
Asking for the folks on funprog.zulipchat.com on how https://leanprover-community.github.io/archive/ works so we could do something similar.
Rob Lewis (Nov 25 2019 at 17:22):
https://github.com/leanprover-community/leanprover-community.github.io/blob/master/import.py for our site specifically. It runs every hour. I factored out a standalone version that the Zulip people have adopted and tweaked a bit: https://github.com/zulip/zulip-archive
Last updated: Dec 20 2023 at 11:08 UTC