Zulip Chat Archive
Stream: lean4
Topic: lake cache on Windows with Chinese locale
Floris van Doorn (Apr 12 2024 at 10:17):
I'm trying to get lake exe cache get
working for @Kunhong Du, who uses Windows with a Chinese locale, and with a Chinese username.
We ran into some issues trying to run this command from the terminal (git bash):
- At first we tried getting the Mathlib cache via the VSCode extension. This downloaded the latest Lean, build the cache executable and then gave the error "unknown file or directory". It was unclear what caused this error exactly.
- A wrong (old) version of curl was used (without the
--parallel
option), even though the default curl on git bash is recent enough lake exe cache unpack
tries to unpack the downloaded ltar files from the wrong folder
Some relevant terminal output:
The first code snippet shows that there are two versions of curl, the default one being recent enough for the --parallel
option.
The second code snippet shows that the old version of curl was used, and that the cache looks into the wrong directory for the .ltar
files (with something that looks like ascii soup for unicode points)
Note: the .ltar
files were correctly downloaded, and they existed in the directory C:\<chinese translation of user>\<username with chinese characters>\.cache\mathlib\54577063a996ea49.ltar
.
杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ curl --version
curl 7.85.0 (x86_64-w64-mingw32) libcurl/7.85.0 OpenSSL/1.1.1q (Schannel) zlib/1.2.12 brotli/1.0.9 zstd/1.5.2 libidn2/2.3.3 libssh2/1.10.0 nghttp2/1.48.0
Release-Date: 2022-08-31
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz MultiSSL NTLM SPNEGO SSL SSPI threadsafe TLS-SRP zstd
杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ where curl
D:\Program Files\Git\mingw64\bin\curl.exe
C:\Windows\System32\curl.exe
杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ which curl
/mingw64/bin/curl
杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ /c/Windows/System32/curl.exe --version
curl 7.55.1 (Windows) libcurl/7.55.1 WinSSL
Release-Date: [unreleased]
Protocols: dict file ftp ftps http https imap imaps pop3 pop3s smtp smtps telnet tftp
Features: AsynchDNS IPv6 Largefile SSPI Kerberos SPNEGO NTLM SSL
杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$
2:
C:\Users\AŀAA蜮cache\mathlib\c020fbf33e698555.ltar not found
C:\Users\AŀAA蜮cache\mathlib\aae289492a74f633.ltar not found
C:\Users\AŀAA蜮cache\mathlib\5fe50c0c632aa6a5.ltar not found
C:\Users\AŀAA蜮cache\mathlib\0e66957958bc5f08.ltar not found
C:\Users\AŀAA蜮cache\mathlib\709bbe2e441cd1df.ltar not found
C:\Users\AŀAA蜮cache\mathlib\fd251426d3c0d198.ltar not found
C:\Users\AŀAA蜮cache\mathlib\68fd3972ac1d5620.ltar not found
C:\Users\AŀAA蜮cache\mathlib\15182fbd6b1ad3e6.ltar not found
C:\Users\AŀAA蜮cache\mathlib\1ab4a87cc94d2f95.ltar not found
C:\Users\AŀAA蜮cache\mathlib\54577063a996ea49.ltar not found
Warning: recommended `curl` version ≥7.70. Found 7.55.1. Can't use `--parallel`.
Attempting to download 4380 file(s)
Decompressing 4380 file(s)
uncaught exception: leantar failed with error code 1
···
Floris van Doorn (Apr 12 2024 at 10:18):
Is there a way to run lake exe cache unpack
and manually specify the folder with the cache files?
Eric Wieser (Apr 12 2024 at 13:37):
Shouldn't that be \.cache
not \cache
?
Floris van Doorn (Apr 12 2024 at 13:41):
You're right, fixed
Mario Carneiro (Apr 12 2024 at 13:43):
the path C:\Users\AŀAA蜮cache\mathlib\54577063a996ea49.ltar
looks corrupted. (my guess is that those are not the right chinese characters) What exactly does "chinese locale" entail here? Is it even UTF-8?
Mario Carneiro (Apr 12 2024 at 13:45):
wait, why does it even say Users
? Didn't you say that Users
is translated to chinese on this system?
Floris van Doorn (Apr 12 2024 at 13:50):
Yes, the home directory is not in C:\Users\
, I expect that this folder doesn't even exist. So I'm not sure how that got there...
And indeed, this path is corrupted. The correct username is 杜坤鸿
, I believe (shown in the git bash
prompt).
Eric Wieser (Apr 12 2024 at 13:50):
I worry that mingw might be to blame here
Eric Wieser (Apr 12 2024 at 13:50):
And you might do better to stay in the system shell rather than git bash
Mario Carneiro (Apr 12 2024 at 13:51):
what is the contents of %HOMEPATH%
?
Floris van Doorn (Apr 12 2024 at 13:51):
With Chinese locale I mean that the Users
folder is written in Chinese and that the username contains Chinese characters.
Mario Carneiro (Apr 12 2024 at 13:52):
My worry is that it's using one of the old pre-UTF-8 chinese encodings
Floris van Doorn (Apr 12 2024 at 13:53):
Ah, ok. I generally recommend git bash
to new users, since I have had problems with powershell for things that worked correctly in git bash
. It sucks if they mess up unicode encoding.
Mario Carneiro (Apr 12 2024 at 13:54):
AFAICT the string Users
exists nowhere in cache
Mario Carneiro (Apr 12 2024 at 13:54):
so it must be coming from the environment in some way
Mario Carneiro (Apr 12 2024 at 13:58):
here's another test, try replacing the main
function of cache
in Cache/Main.lean
with
def main (args : List String) : IO Unit := do
println! s!"HOMEDRIVE = '{← IO.getEnv "HOMEDRIVE"}'"
println! s!"HOMEPATH = '{← IO.getEnv "HOMEPATH"}'"
println! s!"CACHEDIR = '{CACHEDIR}'"
Floris van Doorn (Apr 12 2024 at 14:01):
What is interesting to me is that lake exe cache get
did download the files to the correct place (but it is possible that was from an earlier try, within VSCode). One random hash indeed matched a file in their cache directory.
Mauricio Collares (Apr 12 2024 at 14:11):
Can you print the value of $USERPROFILE as well as $HOMEDRIVE and $HOMEPATH?
Eric Wieser (Apr 12 2024 at 14:17):
Floris van Doorn said:
What is interesting to me is that
lake exe cache get
did download the files to the correct place (but it is possible that was from an earlier try, within VSCode). One random hash indeed matched a file in their cache directory.
This would suggest that curl
is fine but leantar
is unhappy
Eric Wieser (Apr 12 2024 at 14:18):
Probably because you have an mingw curl binary running within a mingw emulation layer of some kind, but then leantar
expects to be run as a standard windows binary
Sebastian Ullrich (Apr 12 2024 at 14:30):
I believe Windows translates some system path names in the UI while the physical path is still the English one. So Users may exist even if you can't see it in some views.
Sébastien Gouëzel (Apr 12 2024 at 14:39):
Yes, that's true (on my system, Users
is shown as Utilisateurs
in the file explorer, but when I open it in a terminal I can see that the real name is Users
).
Sébastien Gouëzel (Apr 12 2024 at 14:41):
And powershell works perfectly fine for me, so maybe give it a try instead of git bash and see if things are really broken there?
Kevin Buzzard (Apr 12 2024 at 16:26):
Youch -- I encourage students away from powershell and towards git bash by default (and IIRC our docs do this too). I look forward to a time when none of us ever type anything on the command line -- there still seem to be many dangers out there.
Kunhong Du (Apr 12 2024 at 17:05):
Thanks for all the replies. Here are the outputs in GitBash
杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ echo $HOMEPATH
\Users\杜坤鸿
杜坤鸿@LAPTOP-NMJJR92D MINGW64 ~
$ echo $HOMEDRIVE
C:
And this one from Powershell
PS E:\Lean\BonnAnalysis> echo $HOME
C:\Users\杜坤鸿
Eric Wieser (Apr 12 2024 at 17:06):
What does curl --version
give in powershell?
Mario Carneiro (Apr 12 2024 at 17:09):
Does echo $HOMEPATH
also work in powershell?
Mario Carneiro (Apr 12 2024 at 17:10):
Mario Carneiro said:
here's another test, try replacing the
main
function ofcache
inCache/Main.lean
withdef main (args : List String) : IO Unit := do println! s!"HOMEDRIVE = '{← IO.getEnv "HOMEDRIVE"}'" println! s!"HOMEPATH = '{← IO.getEnv "HOMEPATH"}'" println! s!"CACHEDIR = '{CACHEDIR}'"
I also still would like to see this test :point_up: because it will tell us if lean has a different opinion of these env vars than git bash / powershell
Kunhong Du (Apr 12 2024 at 17:14):
Eric Wieser said:
What does
curl --version
give in powershell?
It gives
PS C:\Users\杜坤鸿> curl --version
curl : 未能解析此远程名称: '--version'
所在位置 行:1 字符: 1
+ curl --version
+ ~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest],WebExce
ption
+ FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand
It says "Unable to resolve this remote name". Is the command correct?
Actually, I just updated curl. In CMD, I got
C:\Users\杜坤鸿>curl --version
curl 8.7.1 (x86_64-w64-mingw32) libcurl/8.7.1 LibreSSL/3.9.1 zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 WinIDN libpsl/0.21.5 libssh2/1.11.0 nghttp2/1.61.0 ngtcp2/1.4.0 nghttp3/1.2.0
Release-Date: 2024-03-27
Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp ws wss
Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTP3 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM PSL SPNEGO SSL SSPI threadsafe UnixSockets zstd
Eric Wieser (Apr 12 2024 at 17:15):
I think powershell
has a builtin curl
that is doing something else
Eric Wieser (Apr 12 2024 at 17:15):
But lake exe cache
won't see that so that's fine
Kunhong Du (Apr 12 2024 at 17:16):
Mario Carneiro said:
Does
echo $HOMEPATH
also work in powershell?
It gives neither reply nor error.
Mario Carneiro (Apr 12 2024 at 17:17):
I think that means it's empty
Eric Wieser (Apr 12 2024 at 17:18):
What does chcp
give in cmd
?
Kunhong Du (Apr 12 2024 at 17:22):
Eric Wieser said:
What does
chcp
give incmd
?
活动代码页: 936
That's "active code page".
Mario Carneiro (Apr 12 2024 at 17:24):
https://en.wikipedia.org/wiki/Code_page_936_(Microsoft_Windows)
Mario Carneiro (Apr 12 2024 at 17:25):
which sounds like
Mario Carneiro said:
My worry is that it's using one of the old pre-UTF-8 chinese encodings
Floris van Doorn (Apr 12 2024 at 19:19):
For me echo $HOMEPATH
also doesn't give any output, however echo $HOME
does give C:\Users\Floris
.
Mario Carneiro (Apr 12 2024 at 19:24):
does it also do this from inside lean as in the test above?
Floris van Doorn (Apr 12 2024 at 19:25):
In a PM Kunhong mentioned that lake exe cache unpack
on Powershell gives the following output:
PS E:\Lean\BonnAnalysis> lake exe cache unpack
installing leantar 0.1.11
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 669k 100 669k 0 0 669k 0 0:00:01 --:--:-- 0:00:01 669k
No cache files to decompress
This was after building Mathlib manually this afternoon. But it seems that lake exe cache get
will probably work well in Powershell, and that git bash is the culprit here.
Floris van Doorn (Apr 12 2024 at 19:27):
On my own laptop I get this when running Mario's test:
Floris@Dell-E MINGW64 ~/projects/mathlib ((b222407684...))
$ lake exe cache
info: [5/9] Building Cache.Main
info: stdout:
.\.\.\Cache\Main.lean:67:10: warning: unused variable `args` [linter.unusedVariables]
info: [8/9] Compiling Cache.Main
info: [9/9] Linking cache.exe
HOMEDRIVE = '(some C:)'
HOMEPATH = '(some \Users\Floris)'
CACHEDIR = 'C:\Users\Floris\.cache\mathlib'
(and the same results in Powershell)
Floris van Doorn (Apr 12 2024 at 19:28):
Floris van Doorn said:
For me
echo $HOMEPATH
also doesn't give any output, howeverecho $HOME
does giveC:\Users\Floris
.
I should have mentioned, this in on Powershell.
Floris van Doorn (Apr 12 2024 at 19:32):
I don't know if this is useful:
echo | git bash | Powershell |
---|---|---|
$HOME | /c/Users/Floris | C:\Users\Floris |
$HOMEPATH | \Users\Floris | - |
$HOMEDRIVE | C: | - |
$USERPROFILE | C:\Users\Floris | - |
Floris van Doorn (Apr 12 2024 at 19:40):
Floris van Doorn said:
In a PM Kunhong mentioned that
lake exe cache unpack
on Powershell gives the following output:
Actually, this might not say anything, if all the Mathlib olean files are up-to-date lake exe cache unpack
might not unpack... @Kunhong Du maybe you can try lake exe cache unpack!
in powershell?
And for Mario's test, you can go to .lake\packages\mathlib\Cache\Main.lean
and edit the def main
with Mario's version above, save, and then run lake exe cache
on the command line (maybe both of them?) to see what output we get?
Kunhong Du (Apr 12 2024 at 21:26):
I cloned a new repository since I don't wanna ruin my built Mathlib lol.
This is the result in Powershell by changing nothing in the files.
PS E:\Lean\Analysis\BonnAnalysis> lake exe cache unpack!
info: mathlib: cloning https://github.com/leanprover-community/mathlib4.git to '.\.lake/packages\mathlib'
info: std: cloning https://github.com/leanprover/std4 to '.\.lake/packages\std'
info: Qq: cloning https://github.com/leanprover-community/quote4 to '.\.lake/packages\Qq'
info: aesop: cloning https://github.com/leanprover-community/aesop to '.\.lake/packages\aesop'
info: proofwidgets: cloning https://github.com/leanprover-community/ProofWidgets4 to '.\.lake/packages\proofwidgets'
info: Cli: cloning https://github.com/leanprover/lean4-cli to '.\.lake/packages\Cli'
info: importGraph: cloning https://github.com/leanprover-community/import-graph.git to '.\.lake/packages\importGraph'
info: [0/9] Downloading proofwidgets cloud release
info: [0/9] Unpacking proofwidgets cloud release
info: [1/9] Building Cache.IO
info: [2/9] Compiling Cache.IO
info: [2/9] Building Cache.Hashing
info: [3/9] Compiling Cache.Hashing
info: [3/9] Building Cache.Requests
info: [6/9] Compiling Cache.Requests
info: [6/9] Building Cache.Main
info: [7/9] Compiling Cache.Main
info: [9/9] Linking cache.exe
No cache files to decompress
So does it mean it's working? Then I tried to build the library :
PS E:\Lean\Analysis\BonnAnalysis> lake build +BonnAnalysis.Common
error: 'BonnAnalysis.Common': no such file or directory (error code: 2)
file: .\.\.\BonnAnalysis\Common.lean
Is the file missing?
Floris van Doorn (Apr 12 2024 at 21:35):
Yeah, there is no file Common.lean
in that repository (you can do lake build
to build everything). But I expect that it didn't work since it didn't decompress any files.
Eric Wieser (Apr 12 2024 at 21:39):
It's possible that docs#IO.getEnv is not Unicode aware
Eric Wieser (Apr 12 2024 at 21:40):
Presumably the expectation is that it returns utf8 bytes irrespective of the system encoding?
Kunhong Du (Apr 12 2024 at 21:43):
This is the result in Powershell after changing the MAIN.lean file.
PS E:\Lean\Analysis\BonnAnalysis> lake exe cache
warning: mathlib: repository '.\.lake/packages\mathlib' has local changes
info: [7/9] Building Cache.Main
info: stdout:
.\.lake/packages\mathlib\.\.\Cache\Main.lean:67:10: warning: unused variable `args` [linter.unusedVariables]
info: [8/9] Compiling Cache.Main
info: [9/9] Linking cache.exe
HOMEDRIVE = '(some C:)'
HOMEPATH = '(some \Users\杜坤鸿)'
CACHEDIR = 'C:\Users\杜坤鸿\.cache\mathlib'
This is the result in GitBash
杜坤鸿@LAPTOP-NMJJR92D MINGW64 /e/Lean/Analysis/BonnAnalysis (master)
$ lake exe cache
warning: mathlib: repository '.\.lake/packages\mathlib' has local changes
HOMEDRIVE = '(some C:)'
HOMEPATH = '(some \Users\▒▒▒▒▒▒)'
CACHEDIR = 'C:\Users\▒▒▒▒▒▒\.cache\mathlib'
Mac Malone (Apr 12 2024 at 22:14):
@Eric Wieser String
assumes utf8 so I would presume so.
Eric Wieser (Apr 12 2024 at 22:24):
I would guess that std::getenv
just returns whatever bytes the system is using, and you need to do some translation to get the utf8 version
Eric Wieser (Apr 12 2024 at 22:24):
Probably Python's windows unicode code is a good template for what to do here
Eric Wieser (Apr 12 2024 at 22:24):
Of course its still very possible that the issue is in msys
Eric Wieser (Apr 12 2024 at 22:26):
Is it possible to get the raw bytes of a string in Lean? Do we have a pretty-printer that uses \xAB
escape sequences? The test above wouldn't be able to detect if Lean strings somehow ended up containing codepage 936 bytestring
Mario Carneiro (Apr 13 2024 at 01:50):
Eric Wieser said:
Is it possible to get the raw bytes of a string in Lean?
No, and I think it's a hole in the API. From the lean side you can only view a string as a sequence of unicode codepoints, and UTF-8-ness is required at all times (so I don't think it is an option to put malformed data in it and unpack it later). This is why Rust has the OsString
type, it holds strings from the OS that are not necessarily UTF-8 encoded (it's internally just a byte buffer), and there are conversion functions and convenience methods to convert to UTF-8 (which can fail)
Mario Carneiro (Apr 13 2024 at 01:53):
But Kunhong Du 's output suggests this is not the case: the CACHEDIR seems to have been constructed correctly, at least when run from powershell. The git bash output is corrupted, although I can't tell if it's corrupted in lean or there is just a printing issue. (EDIT: I'm pretty sure it's a lean issue, there are 6 replacement characters corresponding to the 6 bytes of cp936 data)
Mario Carneiro (Apr 13 2024 at 01:59):
See https://github.com/nim-lang/Nim/issues/20083 , looks like the solution was to use _wgetenv
(which returns a wide string) and manually perform the conversion to UTF-8
Mario Carneiro (Apr 13 2024 at 02:14):
looks like rust does basically the same thing: call GetEnvironmentVariableW
and then the result is converted from UTF-16 to UTF-8
Mario Carneiro (Apr 13 2024 at 03:16):
I started working on this but it's way too annoying to write this much C code when it's ifdef'd out locally. Reported as lean4#3895
Kunhong Du (Apr 18 2024 at 13:53):
Update: Just tried lake exe cache unpack
in Powershell and got the same output as in Git Bash.
C:\Users\AŀAA蜮cache\mathlib\be81d2378127cfae.ltar not found
C:\Users\AŀAA蜮cache\mathlib\38f9f2381afa3ea6.ltar not found
C:\Users\AŀAA蜮cache\mathlib\caddbb5a7d17b77f.ltar not found
C:\Users\AŀAA蜮cache\mathlib\3e258a1e34d1e6a0.ltar not found
C:\Users\AŀAA蜮cache\mathlib\b6cfb78fd842b756.ltar not found
uncaught exception: leantar failed with error code 1
Eric Wieser (Apr 18 2024 at 14:25):
Try chcp 65001
Kunhong Du (Apr 18 2024 at 18:17):
Eric Wieser said:
Try
chcp 65001
It works! Thanks a lot!!!
For those who meet the same problem as I did: changing code page is the right solution. But for me the command seems to do nothing. So I changed the system locale following this. Now those lake exe cache commands can run normaly in Git Bash.
Mac Malone (Apr 20 2024 at 08:51):
Mario Carneiro said:
Eric Wieser said:
Is it possible to get the raw bytes of a string in Lean?
No, and I think it's a hole in the API. From the lean side you can only view a string as a sequence of unicode codepoints, and UTF-8-ness is required at all times (so I don't think it is an option to put malformed data in it and unpack it later).
Good news! You can actually do both. String.toUTF8
returns the raw bytes of the String
(it does not actually decode them as UTF8) and String.fromUTF8Unchecked
packs raw (possibly malformed) bytes into a String
. The UTF8 decoding is only done on demand when fetching Char
s from the String
. The actually bytes of the String
may not be valid UTF8 under the hood.
Eric Wieser (Apr 20 2024 at 08:51):
Mac Malone (Apr 20 2024 at 09:15):
This fact is probably not desirable long-term, though, as it does give us another native_decide
proof of False:
@[simp] theorem String.map_id : String.map id s = s := by
sorry -- left as an exercise to the reader ;)
def badBytes := ByteArray.mk #[35, 32, 195, 10]
theorem badBytes_eq :
(String.fromUTF8Unchecked badBytes).map id = String.fromUTF8Unchecked badBytes
:= by simp
theorem badBytes_ne :
(String.fromUTF8Unchecked badBytes).map id ≠ String.fromUTF8Unchecked badBytes
:= by native_decide
example : False := badBytes_ne badBytes_eq
(Disclaimer: This is not a soundness issue in Lean as the use of native_decide
requires additional axioms.)
Eric Wieser (Apr 20 2024 at 09:16):
Is marking the docs#String.fromUTF8Unchecked unsafe
sensible here?
Eric Wieser (Apr 20 2024 at 09:17):
Oh, I guess it's already opaque
Mac Malone (Apr 20 2024 at 09:18):
Its not the only way to pack bad bytes into a string, though. Many String input IO operations can also do this.
Mac Malone (Apr 20 2024 at 09:21):
There are multiple String interface design choices that would need to to be changed to remove this hole. For example, the String equality here should likely be performing a char comparison rather than a byte comparison.
Mac Malone (Apr 20 2024 at 09:28):
(It is also just a general concern that map id
is not actually semantically the identity funciton on all strings.)
Mario Carneiro (Apr 20 2024 at 09:48):
these should not be opaque functions. They should do actual decoding, and fromUTF8Unchecked
should take a proof that the array is valid
Mario Carneiro (Apr 20 2024 at 09:51):
I don't think it's valid to have invalid UTF8 in a string because the definition of String
says it's a List Char
. You can prove consequences of this that imply that strings of invalid UTF8 have impossible properties
Mario Carneiro (Apr 20 2024 at 09:54):
Mac Malone said:
Its not the only way to pack bad bytes into a string, though. Many String input IO operations can also do this.
This is also a bug
Mario Carneiro (Apr 20 2024 at 09:55):
This is very bad if untrusted input can cause safe lean code to cause UB
Mario Carneiro (Apr 20 2024 at 12:53):
Here are some other functions that can be used to detect inconsistency in bad utf8 strings:
def badStr := String.fromUTF8Unchecked (ByteArray.mk #[0x80, 97, 0xe0])
def fixedStr := String.mk badStr.data
example : badStr = fixedStr := rfl
#eval badStr.utf8ByteSize -- 3
#eval fixedStr.utf8ByteSize -- 5
#eval badStr.get ⟨1⟩ -- 'a'
#eval fixedStr.get ⟨1⟩ -- 'A'
#eval badStr.next ⟨2⟩ -- 5
#eval fixedStr.next ⟨2⟩ -- 3
#eval badStr.toUTF8 -- [128, 97, 224]
#eval fixedStr.toUTF8 -- [194, 128, 97, 195, 160]
In other words it's not sufficient to fix string equality, many functions leak this detail
Mac Malone (Apr 20 2024 at 18:31):
Mario Carneiro said:
This is very bad if untrusted input can cause safe lean code to cause UB
While I agree that there are problems with this approach, I am confused as to how it causes undefined behavior. Everything behaves here in a very well-defined way, just not in the way one might initially assume (which is unfortunate).
Mario Carneiro (Apr 20 2024 at 18:36):
"compiler unsoundness" aka native_decide proofs of false can be used to cause the lean compiler to skip bounds checks and access things out of bounds because facts that are proved to hold do not hold at runtime
Mario Carneiro (Apr 20 2024 at 18:41):
def badStr := String.fromUTF8Unchecked (ByteArray.mk #[0x80, 97, 0xe0])
def fixedStr := String.mk badStr.data
def liar : Bool := badStr = fixedStr
theorem liar_eq_true : liar := decide_eq_true rfl
def foo := #[1][if liar then 0 else 1000]'(if_pos liar_eq_true ▸ by decide)
#eval foo -- 29863130885, or crash, or other random events
Mac Malone (Apr 20 2024 at 18:42):
On a practical level, its is also worth noting that changing this would require a full overhaul/review of all C/C++ code passing strings to/from Lean, as none of them do any validation currently (as all validaiton is done on accesses in the String API). The C code (and much FFI code) makes heavy use of the fact that you can just pack and unpack a C string or std::string
into a Lean String. Futhermore, performing validation could be a significant performance penalty. Hence, why, as you mentioned, Rust has OsString
.
However, I do agree that, ideally, this would be solved. (Maybe once work restarts on the compiler and other low-level code it will be?)
Mario Carneiro (Apr 20 2024 at 18:43):
This is already fixed in my PR lean4#3958
Mac Malone (Apr 20 2024 at 18:45):
That's nice! :tada: But, that only addresses fromUTF8Unchecked
, not the many other ways to pack bad bytes in.
Mario Carneiro (Apr 20 2024 at 18:46):
that's true, I see this
emitLns ["in = lean_box(0);",
"int i = argc;",
"while (i > 1) {",
" lean_object* n;",
" i--;",
" n = lean_alloc_ctor(1,2,0); lean_ctor_set(n, 0, lean_mk_string(argv[i])); lean_ctor_set(n, 1, in);",
" in = n;",
"}"]
which is very :face_with_peeking_eye:
Mario Carneiro (Apr 20 2024 at 20:12):
Mac Malone said:
On a practical level, its is also worth noting that changing this would require a full overhaul/review of all C/C++ code passing strings to/from Lean, as none of them do any validation currently
Done in lean4#3963
Mario Carneiro (Apr 20 2024 at 20:15):
Futhermore, performing validation could be a significant performance penalty.
It would be nice if we could use some SIMD accelerated UTF-8 validator. But note that validation runs when we're allocating a new string, so the overhead is to some extent hidden by the cost of allocating and copying the text into the new array.
Mac Malone (Apr 20 2024 at 20:44):
@Mario Carneiro If all string data is now valid UTF-8, then we can also optimize many of the indexing primitives to no longer do validation. :big_smile:
Mario Carneiro (Apr 20 2024 at 20:45):
Not completely, you can still use String.get
with a misaligned String.Pos
for example
Mario Carneiro (Apr 20 2024 at 20:46):
but I did add String.Pos.isValid
so we could have a version of String.get
which assumes validity
Mario Carneiro (Apr 20 2024 at 20:49):
Ah, the PR is failing because there are some wonky tests like
def �(x:=fun 0=>0)n:=� �
Mario Carneiro (Apr 20 2024 at 20:50):
I don't even know what we expect to happen here
Jeremy Avigad (Sep 03 2024 at 23:20):
I have a Chinese student in my interactive theorem proving class and she hit on exactly the problem that @Kunhong Du describes above. The solution Kunhong described, this one, almost worked: after checking the locale setting in the control panel, we could run lake exe cache get
in a terminal successfully and everything looks good. But on opening VS Code, she still gets an error, saying IIRC that there are dependencies missing, (curl
, I think). Has anyone come across this and does anyone have any ideas for a workaround? We are using leanprover/lean4:v4.11.0-rc2
.
Marc Huisinga (Sep 04 2024 at 06:54):
Jeremy Avigad said:
I have a Chinese student in my interactive theorem proving class and she hit on exactly the problem that Kunhong Du describes above. The solution Kunhong described, this one, almost worked: after checking the locale setting in the control panel, we could run
lake exe cache get
in a terminal successfully and everything looks good. But on opening VS Code, she still gets an error, saying IIRC that there are dependencies missing, (curl
, I think). Has anyone come across this and does anyone have any ideas for a workaround? We are usingleanprover/lean4:v4.11.0-rc2
.
After this error, what do the Troubleshooting: Show Setup Information
and Troubleshooting: Show Output
commands display?
Jz Pan (Sep 04 2024 at 18:20):
Is it possible to specify Lean install path? In my opinion the easiest solution is allow user to specify an install path without Chinese characters (and without spaces, which frequently causes ancient programs failed to run).
Last updated: May 02 2025 at 03:31 UTC