Zulip Chat Archive

Stream: lean4

Topic: lake cache on Windows with Chinese locale


Floris van Doorn (Apr 12 2024 at 10:17):

I'm trying to get lake exe cache get working for @Kunhong Du, who uses Windows with a Chinese locale, and with a Chinese username.

We ran into some issues trying to run this command from the terminal (git bash):

  • At first we tried getting the Mathlib cache via the VSCode extension. This downloaded the latest Lean, build the cache executable and then gave the error "unknown file or directory". It was unclear what caused this error exactly.
  • A wrong (old) version of curl was used (without the --parallel option), even though the default curl on git bash is recent enough
  • lake exe cache unpack tries to unpack the downloaded ltar files from the wrong folder

Some relevant terminal output:
The first code snippet shows that there are two versions of curl, the default one being recent enough for the --parallel option.
The second code snippet shows that the old version of curl was used, and that the cache looks into the wrong directory for the .ltar files (with something that looks like ascii soup for unicode points)
Note: the .ltar files were correctly downloaded, and they existed in the directory C:\<chinese translation of user>\<username with chinese characters>\.cache\mathlib\54577063a996ea49.ltar.

杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ curl --version
curl 7.85.0 (x86_64-w64-mingw32) libcurl/7.85.0 OpenSSL/1.1.1q (Schannel) zlib/1.2.12 brotli/1.0.9 zstd/1.5.2 libidn2/2.3.3 libssh2/1.10.0 nghttp2/1.48.0
Release-Date: 2022-08-31
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz MultiSSL NTLM SPNEGO SSL SSPI threadsafe TLS-SRP zstd

杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ where curl
D:\Program Files\Git\mingw64\bin\curl.exe
C:\Windows\System32\curl.exe

杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ which curl
/mingw64/bin/curl

杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ /c/Windows/System32/curl.exe --version
curl 7.55.1 (Windows) libcurl/7.55.1 WinSSL
Release-Date: [unreleased]
Protocols: dict file ftp ftps http https imap imaps pop3 pop3s smtp smtps telnet tftp
Features: AsynchDNS IPv6 Largefile SSPI Kerberos SPNEGO NTLM SSL

杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$

2:

C:\Users\AŀAAcache\mathlib\c020fbf33e698555.ltar not found
C:\Users\AŀAAcache\mathlib\aae289492a74f633.ltar not found
C:\Users\AŀAAcache\mathlib\5fe50c0c632aa6a5.ltar not found
C:\Users\AŀAAcache\mathlib\0e66957958bc5f08.ltar not found
C:\Users\AŀAAcache\mathlib\709bbe2e441cd1df.ltar not found
C:\Users\AŀAAcache\mathlib\fd251426d3c0d198.ltar not found
C:\Users\AŀAAcache\mathlib\68fd3972ac1d5620.ltar not found
C:\Users\AŀAAcache\mathlib\15182fbd6b1ad3e6.ltar not found
C:\Users\AŀAAcache\mathlib\1ab4a87cc94d2f95.ltar not found
C:\Users\AŀAAcache\mathlib\54577063a996ea49.ltar not found
Warning: recommended `curl` version 7.70. Found 7.55.1. Can't use `--parallel`.
Attempting to download 4380 file(s)
Decompressing 4380 file(s)
uncaught exception: leantar failed with error code 1
···

Floris van Doorn (Apr 12 2024 at 10:18):

Is there a way to run lake exe cache unpack and manually specify the folder with the cache files?

Eric Wieser (Apr 12 2024 at 13:37):

Shouldn't that be \.cache not \cache?

Floris van Doorn (Apr 12 2024 at 13:41):

You're right, fixed

Mario Carneiro (Apr 12 2024 at 13:43):

the path C:\Users\AŀAA蜮cache\mathlib\54577063a996ea49.ltar looks corrupted. (my guess is that those are not the right chinese characters) What exactly does "chinese locale" entail here? Is it even UTF-8?

Mario Carneiro (Apr 12 2024 at 13:45):

wait, why does it even say Users? Didn't you say that Users is translated to chinese on this system?

Floris van Doorn (Apr 12 2024 at 13:50):

Yes, the home directory is not in C:\Users\, I expect that this folder doesn't even exist. So I'm not sure how that got there...
And indeed, this path is corrupted. The correct username is 杜坤鸿, I believe (shown in the git bash prompt).

Eric Wieser (Apr 12 2024 at 13:50):

I worry that mingw might be to blame here

Eric Wieser (Apr 12 2024 at 13:50):

And you might do better to stay in the system shell rather than git bash

Mario Carneiro (Apr 12 2024 at 13:51):

what is the contents of %HOMEPATH%?

Floris van Doorn (Apr 12 2024 at 13:51):

With Chinese locale I mean that the Users folder is written in Chinese and that the username contains Chinese characters.

Mario Carneiro (Apr 12 2024 at 13:52):

My worry is that it's using one of the old pre-UTF-8 chinese encodings

Floris van Doorn (Apr 12 2024 at 13:53):

Ah, ok. I generally recommend git bash to new users, since I have had problems with powershell for things that worked correctly in git bash. It sucks if they mess up unicode encoding.

Mario Carneiro (Apr 12 2024 at 13:54):

AFAICT the string Users exists nowhere in cache

Mario Carneiro (Apr 12 2024 at 13:54):

so it must be coming from the environment in some way

Mario Carneiro (Apr 12 2024 at 13:58):

here's another test, try replacing the main function of cache in Cache/Main.lean with

def main (args : List String) : IO Unit := do
  println! s!"HOMEDRIVE = '{← IO.getEnv "HOMEDRIVE"}'"
  println! s!"HOMEPATH = '{← IO.getEnv "HOMEPATH"}'"
  println! s!"CACHEDIR = '{CACHEDIR}'"

Floris van Doorn (Apr 12 2024 at 14:01):

What is interesting to me is that lake exe cache get did download the files to the correct place (but it is possible that was from an earlier try, within VSCode). One random hash indeed matched a file in their cache directory.

Mauricio Collares (Apr 12 2024 at 14:11):

Can you print the value of $USERPROFILE as well as $HOMEDRIVE and $HOMEPATH?

Eric Wieser (Apr 12 2024 at 14:17):

Floris van Doorn said:

What is interesting to me is that lake exe cache get did download the files to the correct place (but it is possible that was from an earlier try, within VSCode). One random hash indeed matched a file in their cache directory.

This would suggest that curl is fine but leantar is unhappy

Eric Wieser (Apr 12 2024 at 14:18):

Probably because you have an mingw curl binary running within a mingw emulation layer of some kind, but then leantar expects to be run as a standard windows binary

Sebastian Ullrich (Apr 12 2024 at 14:30):

I believe Windows translates some system path names in the UI while the physical path is still the English one. So Users may exist even if you can't see it in some views.

Sébastien Gouëzel (Apr 12 2024 at 14:39):

Yes, that's true (on my system, Users is shown as Utilisateurs in the file explorer, but when I open it in a terminal I can see that the real name is Users).

Sébastien Gouëzel (Apr 12 2024 at 14:41):

And powershell works perfectly fine for me, so maybe give it a try instead of git bash and see if things are really broken there?

Kevin Buzzard (Apr 12 2024 at 16:26):

Youch -- I encourage students away from powershell and towards git bash by default (and IIRC our docs do this too). I look forward to a time when none of us ever type anything on the command line -- there still seem to be many dangers out there.

Kunhong Du (Apr 12 2024 at 17:05):

Thanks for all the replies. Here are the outputs in GitBash

杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ echo $HOMEPATH
\Users\杜坤鸿

杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ echo $HOMEDRIVE
C:

And this one from Powershell

PS E:\Lean\BonnAnalysis> echo $HOME
C:\Users\杜坤鸿

Eric Wieser (Apr 12 2024 at 17:06):

What does curl --version give in powershell?

Mario Carneiro (Apr 12 2024 at 17:09):

Does echo $HOMEPATH also work in powershell?

Mario Carneiro (Apr 12 2024 at 17:10):

Mario Carneiro said:

here's another test, try replacing the main function of cache in Cache/Main.lean with

def main (args : List String) : IO Unit := do
  println! s!"HOMEDRIVE = '{← IO.getEnv "HOMEDRIVE"}'"
  println! s!"HOMEPATH = '{← IO.getEnv "HOMEPATH"}'"
  println! s!"CACHEDIR = '{CACHEDIR}'"

I also still would like to see this test :point_up: because it will tell us if lean has a different opinion of these env vars than git bash / powershell

Kunhong Du (Apr 12 2024 at 17:14):

Eric Wieser said:

What does curl --version give in powershell?

It gives

PS C:\Users\杜坤鸿> curl --version
curl : 未能解析此远程名称: '--version'
所在位置 :1 字符: 1
+ curl --version
+ ~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest]WebExce
    ption
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand

It says "Unable to resolve this remote name". Is the command correct?

Actually, I just updated curl. In CMD, I got

C:\Users\杜坤鸿>curl --version
curl 8.7.1 (x86_64-w64-mingw32) libcurl/8.7.1 LibreSSL/3.9.1 zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 WinIDN libpsl/0.21.5 libssh2/1.11.0 nghttp2/1.61.0 ngtcp2/1.4.0 nghttp3/1.2.0
Release-Date: 2024-03-27
Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp ws wss
Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTP3 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM PSL SPNEGO SSL SSPI threadsafe UnixSockets zstd

Eric Wieser (Apr 12 2024 at 17:15):

I think powershell has a builtin curl that is doing something else

Eric Wieser (Apr 12 2024 at 17:15):

But lake exe cache won't see that so that's fine

Kunhong Du (Apr 12 2024 at 17:16):

Mario Carneiro said:

Does echo $HOMEPATH also work in powershell?

It gives neither reply nor error.

Mario Carneiro (Apr 12 2024 at 17:17):

I think that means it's empty

Eric Wieser (Apr 12 2024 at 17:18):

What does chcp give in cmd?

Kunhong Du (Apr 12 2024 at 17:22):

Eric Wieser said:

What does chcp give in cmd?

活动代码页: 936

That's "active code page".

Mario Carneiro (Apr 12 2024 at 17:24):

https://en.wikipedia.org/wiki/Code_page_936_(Microsoft_Windows)

Mario Carneiro (Apr 12 2024 at 17:25):

which sounds like
Mario Carneiro said:

My worry is that it's using one of the old pre-UTF-8 chinese encodings

Floris van Doorn (Apr 12 2024 at 19:19):

For me echo $HOMEPATH also doesn't give any output, however echo $HOME does give C:\Users\Floris.

Mario Carneiro (Apr 12 2024 at 19:24):

does it also do this from inside lean as in the test above?

Floris van Doorn (Apr 12 2024 at 19:25):

In a PM Kunhong mentioned that lake exe cache unpack on Powershell gives the following output:

PS E:\Lean\BonnAnalysis> lake exe cache unpack
installing leantar 0.1.11
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  669k  100  669k    0     0   669k      0  0:00:01 --:--:--  0:00:01  669k

No cache files to decompress

This was after building Mathlib manually this afternoon. But it seems that lake exe cache get will probably work well in Powershell, and that git bash is the culprit here.

Floris van Doorn (Apr 12 2024 at 19:27):

On my own laptop I get this when running Mario's test:

Floris@Dell-E MINGW64 ~/projects/mathlib ((b222407684...))
$ lake exe cache
info: [5/9] Building Cache.Main
info: stdout:
.\.\.\Cache\Main.lean:67:10: warning: unused variable `args` [linter.unusedVariables]
info: [8/9] Compiling Cache.Main
info: [9/9] Linking cache.exe
HOMEDRIVE = '(some C:)'
HOMEPATH = '(some \Users\Floris)'
CACHEDIR = 'C:\Users\Floris\.cache\mathlib'

(and the same results in Powershell)

Floris van Doorn (Apr 12 2024 at 19:28):

Floris van Doorn said:

For me echo $HOMEPATH also doesn't give any output, however echo $HOME does give C:\Users\Floris.

I should have mentioned, this in on Powershell.

Floris van Doorn (Apr 12 2024 at 19:32):

I don't know if this is useful:

echo git bash Powershell
$HOME /c/Users/Floris C:\Users\Floris
$HOMEPATH \Users\Floris -
$HOMEDRIVE C: -
$USERPROFILE C:\Users\Floris -

Floris van Doorn (Apr 12 2024 at 19:40):

Floris van Doorn said:

In a PM Kunhong mentioned that lake exe cache unpack on Powershell gives the following output:

Actually, this might not say anything, if all the Mathlib olean files are up-to-date lake exe cache unpack might not unpack... @Kunhong Du maybe you can try lake exe cache unpack! in powershell?
And for Mario's test, you can go to .lake\packages\mathlib\Cache\Main.lean and edit the def main with Mario's version above, save, and then run lake exe cache on the command line (maybe both of them?) to see what output we get?

Kunhong Du (Apr 12 2024 at 21:26):

I cloned a new repository since I don't wanna ruin my built Mathlib lol.

This is the result in Powershell by changing nothing in the files.

PS E:\Lean\Analysis\BonnAnalysis> lake exe cache unpack!
info: mathlib: cloning https://github.com/leanprover-community/mathlib4.git to '.\.lake/packages\mathlib'
info: std: cloning https://github.com/leanprover/std4 to '.\.lake/packages\std'
info: Qq: cloning https://github.com/leanprover-community/quote4 to '.\.lake/packages\Qq'
info: aesop: cloning https://github.com/leanprover-community/aesop to '.\.lake/packages\aesop'
info: proofwidgets: cloning https://github.com/leanprover-community/ProofWidgets4 to '.\.lake/packages\proofwidgets'
info: Cli: cloning https://github.com/leanprover/lean4-cli to '.\.lake/packages\Cli'
info: importGraph: cloning https://github.com/leanprover-community/import-graph.git to '.\.lake/packages\importGraph'
info: [0/9] Downloading proofwidgets cloud release
info: [0/9] Unpacking proofwidgets cloud release
info: [1/9] Building Cache.IO
info: [2/9] Compiling Cache.IO
info: [2/9] Building Cache.Hashing
info: [3/9] Compiling Cache.Hashing
info: [3/9] Building Cache.Requests
info: [6/9] Compiling Cache.Requests
info: [6/9] Building Cache.Main
info: [7/9] Compiling Cache.Main
info: [9/9] Linking cache.exe
No cache files to decompress

So does it mean it's working? Then I tried to build the library :

PS E:\Lean\Analysis\BonnAnalysis> lake build +BonnAnalysis.Common
error: 'BonnAnalysis.Common': no such file or directory (error code: 2)
  file: .\.\.\BonnAnalysis\Common.lean

Is the file missing?

Floris van Doorn (Apr 12 2024 at 21:35):

Yeah, there is no file Common.lean in that repository (you can do lake build to build everything). But I expect that it didn't work since it didn't decompress any files.

Eric Wieser (Apr 12 2024 at 21:39):

It's possible that docs#IO.getEnv is not Unicode aware

Eric Wieser (Apr 12 2024 at 21:40):

Presumably the expectation is that it returns utf8 bytes irrespective of the system encoding?

Kunhong Du (Apr 12 2024 at 21:43):

This is the result in Powershell after changing the MAIN.lean file.

PS E:\Lean\Analysis\BonnAnalysis> lake exe cache
warning: mathlib: repository '.\.lake/packages\mathlib' has local changes
info: [7/9] Building Cache.Main
info: stdout:
.\.lake/packages\mathlib\.\.\Cache\Main.lean:67:10: warning: unused variable `args` [linter.unusedVariables]
info: [8/9] Compiling Cache.Main
info: [9/9] Linking cache.exe
HOMEDRIVE = '(some C:)'
HOMEPATH = '(some \Users\杜坤鸿)'
CACHEDIR = 'C:\Users\杜坤鸿\.cache\mathlib'

This is the result in GitBash

杜坤鸿@LAPTOP-NMJJR92D MINGW64 /e/Lean/Analysis/BonnAnalysis (master)
$ lake exe cache
warning: mathlib: repository '.\.lake/packages\mathlib' has local changes
HOMEDRIVE = '(some C:)'
HOMEPATH = '(some \Users\▒▒▒▒▒▒)'
CACHEDIR = 'C:\Users\▒▒▒▒▒▒\.cache\mathlib'

Mac Malone (Apr 12 2024 at 22:14):

@Eric Wieser String assumes utf8 so I would presume so.

Eric Wieser (Apr 12 2024 at 22:24):

I would guess that std::getenv just returns whatever bytes the system is using, and you need to do some translation to get the utf8 version

Eric Wieser (Apr 12 2024 at 22:24):

Probably Python's windows unicode code is a good template for what to do here

Eric Wieser (Apr 12 2024 at 22:24):

Of course its still very possible that the issue is in msys

Eric Wieser (Apr 12 2024 at 22:26):

Is it possible to get the raw bytes of a string in Lean? Do we have a pretty-printer that uses \xAB escape sequences? The test above wouldn't be able to detect if Lean strings somehow ended up containing codepage 936 bytestring

Mario Carneiro (Apr 13 2024 at 01:50):

Eric Wieser said:

Is it possible to get the raw bytes of a string in Lean?

No, and I think it's a hole in the API. From the lean side you can only view a string as a sequence of unicode codepoints, and UTF-8-ness is required at all times (so I don't think it is an option to put malformed data in it and unpack it later). This is why Rust has the OsString type, it holds strings from the OS that are not necessarily UTF-8 encoded (it's internally just a byte buffer), and there are conversion functions and convenience methods to convert to UTF-8 (which can fail)

Mario Carneiro (Apr 13 2024 at 01:53):

But Kunhong Du 's output suggests this is not the case: the CACHEDIR seems to have been constructed correctly, at least when run from powershell. The git bash output is corrupted, although I can't tell if it's corrupted in lean or there is just a printing issue. (EDIT: I'm pretty sure it's a lean issue, there are 6 replacement characters corresponding to the 6 bytes of cp936 data)

Mario Carneiro (Apr 13 2024 at 01:59):

See https://github.com/nim-lang/Nim/issues/20083 , looks like the solution was to use _wgetenv (which returns a wide string) and manually perform the conversion to UTF-8

Mario Carneiro (Apr 13 2024 at 02:14):

looks like rust does basically the same thing: call GetEnvironmentVariableW and then the result is converted from UTF-16 to UTF-8

Mario Carneiro (Apr 13 2024 at 03:16):

I started working on this but it's way too annoying to write this much C code when it's ifdef'd out locally. Reported as lean4#3895

Kunhong Du (Apr 18 2024 at 13:53):

Update: Just tried lake exe cache unpack in Powershell and got the same output as in Git Bash.

C:\Users\AŀAAcache\mathlib\be81d2378127cfae.ltar not found
C:\Users\AŀAAcache\mathlib\38f9f2381afa3ea6.ltar not found
C:\Users\AŀAAcache\mathlib\caddbb5a7d17b77f.ltar not found
C:\Users\AŀAAcache\mathlib\3e258a1e34d1e6a0.ltar not found
C:\Users\AŀAAcache\mathlib\b6cfb78fd842b756.ltar not found
uncaught exception: leantar failed with error code 1

Eric Wieser (Apr 18 2024 at 14:25):

Try chcp 65001

Kunhong Du (Apr 18 2024 at 18:17):

Eric Wieser said:

Try chcp 65001

It works! Thanks a lot!!!

For those who meet the same problem as I did: changing code page is the right solution. But for me the command seems to do nothing. So I changed the system locale following this. Now those lake exe cache commands can run normaly in Git Bash.

Mac Malone (Apr 20 2024 at 08:51):

Mario Carneiro said:

Eric Wieser said:

Is it possible to get the raw bytes of a string in Lean?

No, and I think it's a hole in the API. From the lean side you can only view a string as a sequence of unicode codepoints, and UTF-8-ness is required at all times (so I don't think it is an option to put malformed data in it and unpack it later).

Good news! You can actually do both. String.toUTF8 returns the raw bytes of the String (it does not actually decode them as UTF8) and String.fromUTF8Unchecked packs raw (possibly malformed) bytes into a String. The UTF8 decoding is only done on demand when fetching Chars from the String. The actually bytes of the String may not be valid UTF8 under the hood.

Eric Wieser (Apr 20 2024 at 08:51):

(docs#String.toUTF8)

Mac Malone (Apr 20 2024 at 09:15):

This fact is probably not desirable long-term, though, as it does give us another native_decide proof of False:

@[simp] theorem String.map_id : String.map id s = s := by
  sorry -- left as an exercise to the reader ;)

def badBytes := ByteArray.mk #[35, 32, 195, 10]

theorem badBytes_eq :
  (String.fromUTF8Unchecked badBytes).map id = String.fromUTF8Unchecked badBytes
:= by simp

theorem badBytes_ne :
  (String.fromUTF8Unchecked badBytes).map id  String.fromUTF8Unchecked badBytes
:= by native_decide

example : False := badBytes_ne badBytes_eq

(Disclaimer: This is not a soundness issue in Lean as the use of native_decide requires additional axioms.)

Eric Wieser (Apr 20 2024 at 09:16):

Is marking the docs#String.fromUTF8Unchecked unsafe sensible here?

Eric Wieser (Apr 20 2024 at 09:17):

Oh, I guess it's already opaque

Mac Malone (Apr 20 2024 at 09:18):

Its not the only way to pack bad bytes into a string, though. Many String input IO operations can also do this.

Mac Malone (Apr 20 2024 at 09:21):

There are multiple String interface design choices that would need to to be changed to remove this hole. For example, the String equality here should likely be performing a char comparison rather than a byte comparison.

Mac Malone (Apr 20 2024 at 09:28):

(It is also just a general concern that map id is not actually semantically the identity funciton on all strings.)

Mario Carneiro (Apr 20 2024 at 09:48):

these should not be opaque functions. They should do actual decoding, and fromUTF8Unchecked should take a proof that the array is valid

Mario Carneiro (Apr 20 2024 at 09:51):

I don't think it's valid to have invalid UTF8 in a string because the definition of String says it's a List Char. You can prove consequences of this that imply that strings of invalid UTF8 have impossible properties

Mario Carneiro (Apr 20 2024 at 09:54):

Mac Malone said:

Its not the only way to pack bad bytes into a string, though. Many String input IO operations can also do this.

This is also a bug

Mario Carneiro (Apr 20 2024 at 09:55):

This is very bad if untrusted input can cause safe lean code to cause UB

Mario Carneiro (Apr 20 2024 at 12:53):

Here are some other functions that can be used to detect inconsistency in bad utf8 strings:

def badStr := String.fromUTF8Unchecked (ByteArray.mk #[0x80, 97, 0xe0])
def fixedStr := String.mk badStr.data

example : badStr = fixedStr := rfl

#eval badStr.utf8ByteSize -- 3
#eval fixedStr.utf8ByteSize -- 5

#eval badStr.get 1 -- 'a'
#eval fixedStr.get 1 -- 'A'

#eval badStr.next 2 -- 5
#eval fixedStr.next 2 -- 3

#eval badStr.toUTF8 -- [128, 97, 224]
#eval fixedStr.toUTF8 -- [194, 128, 97, 195, 160]

In other words it's not sufficient to fix string equality, many functions leak this detail

Mac Malone (Apr 20 2024 at 18:31):

Mario Carneiro said:

This is very bad if untrusted input can cause safe lean code to cause UB

While I agree that there are problems with this approach, I am confused as to how it causes undefined behavior. Everything behaves here in a very well-defined way, just not in the way one might initially assume (which is unfortunate).

Mario Carneiro (Apr 20 2024 at 18:36):

"compiler unsoundness" aka native_decide proofs of false can be used to cause the lean compiler to skip bounds checks and access things out of bounds because facts that are proved to hold do not hold at runtime

Mario Carneiro (Apr 20 2024 at 18:41):

def badStr := String.fromUTF8Unchecked (ByteArray.mk #[0x80, 97, 0xe0])
def fixedStr := String.mk badStr.data

def liar : Bool := badStr = fixedStr

theorem liar_eq_true : liar := decide_eq_true rfl

def foo := #[1][if liar then 0 else 1000]'(if_pos liar_eq_true  by decide)

#eval foo -- 29863130885, or crash, or other random events

Mac Malone (Apr 20 2024 at 18:42):

On a practical level, its is also worth noting that changing this would require a full overhaul/review of all C/C++ code passing strings to/from Lean, as none of them do any validation currently (as all validaiton is done on accesses in the String API). The C code (and much FFI code) makes heavy use of the fact that you can just pack and unpack a C string or std::string into a Lean String. Futhermore, performing validation could be a significant performance penalty. Hence, why, as you mentioned, Rust has OsString.

However, I do agree that, ideally, this would be solved. (Maybe once work restarts on the compiler and other low-level code it will be?)

Mario Carneiro (Apr 20 2024 at 18:43):

This is already fixed in my PR lean4#3958

Mac Malone (Apr 20 2024 at 18:45):

That's nice! :tada: But, that only addresses fromUTF8Unchecked, not the many other ways to pack bad bytes in.

Mario Carneiro (Apr 20 2024 at 18:46):

that's true, I see this

      emitLns ["in = lean_box(0);",
               "int i = argc;",
               "while (i > 1) {",
               " lean_object* n;",
               " i--;",
               " n = lean_alloc_ctor(1,2,0); lean_ctor_set(n, 0, lean_mk_string(argv[i])); lean_ctor_set(n, 1, in);",
               " in = n;",
              "}"]

which is very :face_with_peeking_eye:

Mario Carneiro (Apr 20 2024 at 20:12):

Mac Malone said:

On a practical level, its is also worth noting that changing this would require a full overhaul/review of all C/C++ code passing strings to/from Lean, as none of them do any validation currently

Done in lean4#3963

Mario Carneiro (Apr 20 2024 at 20:15):

Futhermore, performing validation could be a significant performance penalty.

It would be nice if we could use some SIMD accelerated UTF-8 validator. But note that validation runs when we're allocating a new string, so the overhead is to some extent hidden by the cost of allocating and copying the text into the new array.

Mac Malone (Apr 20 2024 at 20:44):

@Mario Carneiro If all string data is now valid UTF-8, then we can also optimize many of the indexing primitives to no longer do validation. :big_smile:

Mario Carneiro (Apr 20 2024 at 20:45):

Not completely, you can still use String.get with a misaligned String.Pos for example

Mario Carneiro (Apr 20 2024 at 20:46):

but I did add String.Pos.isValid so we could have a version of String.get which assumes validity

Mario Carneiro (Apr 20 2024 at 20:49):

Ah, the PR is failing because there are some wonky tests like

def (x:=fun 0=>0)n:= 

Mario Carneiro (Apr 20 2024 at 20:50):

I don't even know what we expect to happen here

Jeremy Avigad (Sep 03 2024 at 23:20):

I have a Chinese student in my interactive theorem proving class and she hit on exactly the problem that @Kunhong Du describes above. The solution Kunhong described, this one, almost worked: after checking the locale setting in the control panel, we could run lake exe cache get in a terminal successfully and everything looks good. But on opening VS Code, she still gets an error, saying IIRC that there are dependencies missing, (curl, I think). Has anyone come across this and does anyone have any ideas for a workaround? We are using leanprover/lean4:v4.11.0-rc2.

Marc Huisinga (Sep 04 2024 at 06:54):

Jeremy Avigad said:

I have a Chinese student in my interactive theorem proving class and she hit on exactly the problem that Kunhong Du describes above. The solution Kunhong described, this one, almost worked: after checking the locale setting in the control panel, we could run lake exe cache get in a terminal successfully and everything looks good. But on opening VS Code, she still gets an error, saying IIRC that there are dependencies missing, (curl, I think). Has anyone come across this and does anyone have any ideas for a workaround? We are using leanprover/lean4:v4.11.0-rc2.

After this error, what do the Troubleshooting: Show Setup Information and Troubleshooting: Show Output commands display?

Jz Pan (Sep 04 2024 at 18:20):

Is it possible to specify Lean install path? In my opinion the easiest solution is allow user to specify an install path without Chinese characters (and without spaces, which frequently causes ancient programs failed to run).


Last updated: May 02 2025 at 03:31 UTC