Zulip Chat Archive
Stream: lean4 dev
Topic: Conversion from String to ByteArray
Cayden Codel (Sep 17 2024 at 18:48):
Suppose I have a String
that I know is fully ASCII. Then reading any position in the string is the same as accessing its underlying byte index through getUtf8Byte
, which is implemented by a C function in O(1).
However, it may be more convenient (and performant?) to treat the String
as a Lean ByteArray
instead. The toUTF8
function looks promising, but its C implementation, lean_string_to_utf8
copies the entire array, which is O(n).
Ideally, the string could just be re-cast as a ByteArray
, but in order for Lean to act in a functional way, the copy seems like a sunk cost. But a friend (James Gallicchio) suggested to me that if the string's reference counter was 0 (or 1?), then the header could be discarded, and an array could be returned in O(1).
Thoughts? This seems like a quick PR, but I've not touched Lean's C++ internals before, and I haven't made a PR to a Lean project before, so I don't know the process/next steps.
François G. Dorais (Sep 17 2024 at 19:03):
I'm afraid the two types are not actually compatible at the C-level. The issue is the lean_string_object
's m_length
field which is stored where lean_sarray_object
expects m_data
to start. So if you coerce a lean_string_object
into a lean_sarray_object
, you need to add sizeof(size_t)
bytes to m_size
and m_capacity
, then remember that the first sizeof(size_t)
bytes of the m_data
are the string length rather than string contents.
/* Scalar arrays */
typedef struct {
lean_object m_header;
size_t m_size;
size_t m_capacity;
uint8_t m_data[0];
} lean_sarray_object;
typedef struct {
lean_object m_header;
size_t m_size; /* byte length including '\0' terminator */
size_t m_capacity;
size_t m_length; /* UTF8 length */
char m_data[0];
} lean_string_object;
François G. Dorais (Sep 17 2024 at 19:58):
For example:
def myString := "My string"
unsafe def myBytes : ByteArray := unsafeCast myString
#eval myBytes -- [9, 0, 0, 0, 0, 0, 0, 0, 77, 121]
Cayden Codel (Sep 17 2024 at 20:00):
Yes, I have already experimented with unsafeCast
, and found that the underlying data models are incompatible. So the next question is whether a string s
can be made into a ByteArray
arr
in O(1), given that s
is not referred to after the call to a modified toUTF8
function.
𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Sep 17 2024 at 20:01):
I think what Cayden is saying, though, is not that unsafeCast
between the two types should work, but rather that the C++ implementation of toUtf8
could do a zero-copy operation if RC = 1.
EDIT: I got frontrun.
Mario Carneiro (Sep 17 2024 at 20:01):
it can't, because the data models are not compatible
Mario Carneiro (Sep 17 2024 at 20:03):
it could do a memmove
to shift the data back by 8 bytes when going string -> bytes, but vice versa needs 8 extra bytes that may not be allocated, and it still needs a memmove
to bring the data forward by 8 bytes which is O(n)
Mario Carneiro (Sep 17 2024 at 20:04):
plus, it needs to calculate the char count, so that means either an unsafe method or a linear pass
Cayden Codel (Sep 17 2024 at 20:05):
Well, the conversion I'm looking for only needs to be done one way, but I understand the problem now. Ah, the joys of variable-length structs in C
Mario Carneiro (Sep 17 2024 at 20:05):
yeah, if the data was stored in a separate allocation this would not be a problem
𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Sep 17 2024 at 20:07):
What happens if a misaligned struct with lean_sarray_object* p = &str.m_header + sizeof(size_t)
is produced, and fields other than m_data
rewritten to have the correct values? Perhaps some deallocation would fail later.
Cayden Codel (Sep 17 2024 at 20:09):
I sense some very careful bookkeeping would need to be kept, and that way lies madness
Mario Carneiro (Sep 17 2024 at 20:14):
yes, that would most likely cause a later free
to segfault
François G. Dorais (Sep 17 2024 at 20:26):
A better, though heavy handed approach, would be to make these (and all other?) types more compatible by adding some padding before m_data
so that the offset of m_data
is always the same.
Mario Carneiro (Sep 17 2024 at 20:27):
or deleting the char count field...
François G. Dorais (Sep 17 2024 at 20:36):
That would make String.length
O(n). Perhaps getting rid of lean_string_object
altogether and defining String
as an extension of ByteArray
with some support to cache String.length
somehow.
François G. Dorais (Sep 17 2024 at 20:39):
Wait: did you mean deleting m_size
or m_length
? Once you know m_length
, m_capacity
is enough for most operations, I think.
Mario Carneiro (Sep 17 2024 at 20:40):
François G. Dorais said:
That would make
String.length
O(n).
So? List.length
is O(n) too. And String char count isn't even the right measurement for most activities you want to do with a string. (What would rust do?)
Mario Carneiro (Sep 17 2024 at 20:41):
I don't like the term "string length" because it's ambiguous. Even worse if it's next to "string size"
Mario Carneiro (Sep 17 2024 at 20:41):
I don't think anyone can reliably guess which one is which
François G. Dorais (Sep 17 2024 at 20:42):
Agreed. Might be worth making an RFC?
Mario Carneiro (Sep 17 2024 at 20:42):
for which part?
François G. Dorais (Sep 17 2024 at 20:43):
Deleting m_length
(equivalently implementing String
as a lean_sarray_object
and removing lean_string_object
).
François G. Dorais (Sep 17 2024 at 20:50):
The RFC would mostly exist to record the idea. It's fairly obvious that it would get the lowest possible priority, at best.
Henrik Böving (Sep 17 2024 at 21:07):
We have floated this idea internally at the FRO before. I think an RFC that tries to address this would at least be of interested, can't make any guarantees on whether it will be accepted of course.
François G. Dorais (Sep 18 2024 at 00:09):
I'll put that RFC on my todo list. It's low priority to me as well so if someone gets around to it before me then I will be happy with that.
Last updated: May 02 2025 at 03:31 UTC