Zulip Chat Archive

Stream: lean4 dev

Topic: Conversion from String to ByteArray

Cayden Codel (Sep 17 2024 at 18:48):

Suppose I have a String that I know is fully ASCII. Then reading any position in the string is the same as accessing its underlying byte index through getUtf8Byte, which is implemented by a C function in O(1).

However, it may be more convenient (and performant?) to treat the String as a Lean ByteArray instead. The toUTF8 function looks promising, but its C implementation, lean_string_to_utf8 copies the entire array, which is O(n).

Ideally, the string could just be re-cast as a ByteArray, but in order for Lean to act in a functional way, the copy seems like a sunk cost. But a friend (James Gallicchio) suggested to me that if the string's reference counter was 0 (or 1?), then the header could be discarded, and an array could be returned in O(1).

Thoughts? This seems like a quick PR, but I've not touched Lean's C++ internals before, and I haven't made a PR to a Lean project before, so I don't know the process/next steps.

François G. Dorais (Sep 17 2024 at 19:03):

I'm afraid the two types are not actually compatible at the C-level. The issue is the lean_string_object's m_length field which is stored where lean_sarray_object expects m_data to start. So if you coerce a lean_string_object into a lean_sarray_object, you need to add sizeof(size_t) bytes to m_size and m_capacity, then remember that the first sizeof(size_t) bytes of the m_data are the string length rather than string contents.

/* Scalar arrays */
typedef struct {
    lean_object   m_header;
    size_t        m_size;
    size_t        m_capacity;
    uint8_t       m_data[0];
} lean_sarray_object;

typedef struct {
    lean_object m_header;
    size_t      m_size;     /* byte length including '\0' terminator */
    size_t      m_capacity;
    size_t      m_length;   /* UTF8 length */
    char        m_data[0];
} lean_string_object;

François G. Dorais (Sep 17 2024 at 19:58):

For example:

def myString := "My string"

unsafe def myBytes : ByteArray := unsafeCast myString

#eval myBytes -- [9, 0, 0, 0, 0, 0, 0, 0, 77, 121]

Cayden Codel (Sep 17 2024 at 20:00):

Yes, I have already experimented with unsafeCast, and found that the underlying data models are incompatible. So the next question is whether a string s can be made into a ByteArray arr in O(1), given that s is not referred to after the call to a modified toUTF8 function.

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Sep 17 2024 at 20:01):

I think what Cayden is saying, though, is not that unsafeCast between the two types should work, but rather that the C++ implementation of toUtf8 could do a zero-copy operation if RC = 1.

EDIT: I got frontrun.

Mario Carneiro (Sep 17 2024 at 20:01):

it can't, because the data models are not compatible

Mario Carneiro (Sep 17 2024 at 20:03):

it could do a memmove to shift the data back by 8 bytes when going string -> bytes, but vice versa needs 8 extra bytes that may not be allocated, and it still needs a memmove to bring the data forward by 8 bytes which is O(n)

Mario Carneiro (Sep 17 2024 at 20:04):

plus, it needs to calculate the char count, so that means either an unsafe method or a linear pass

Cayden Codel (Sep 17 2024 at 20:05):

Well, the conversion I'm looking for only needs to be done one way, but I understand the problem now. Ah, the joys of variable-length structs in C

Mario Carneiro (Sep 17 2024 at 20:05):

yeah, if the data was stored in a separate allocation this would not be a problem

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Sep 17 2024 at 20:07):

What happens if a misaligned struct with lean_sarray_object* p = &str.m_header + sizeof(size_t) is produced, and fields other than m_data rewritten to have the correct values? Perhaps some deallocation would fail later.

Cayden Codel (Sep 17 2024 at 20:09):

I sense some very careful bookkeeping would need to be kept, and that way lies madness

Mario Carneiro (Sep 17 2024 at 20:14):

yes, that would most likely cause a later free to segfault

François G. Dorais (Sep 17 2024 at 20:26):

A better, though heavy handed approach, would be to make these (and all other?) types more compatible by adding some padding before m_data so that the offset of m_data is always the same.

Mario Carneiro (Sep 17 2024 at 20:27):

or deleting the char count field...

François G. Dorais (Sep 17 2024 at 20:36):

That would make String.length O(n). Perhaps getting rid of lean_string_object altogether and defining String as an extension of ByteArray with some support to cache String.length somehow.

François G. Dorais (Sep 17 2024 at 20:39):

Wait: did you mean deleting m_size or m_length? Once you know m_length, m_capacity is enough for most operations, I think.

Mario Carneiro (Sep 17 2024 at 20:40):

François G. Dorais said:

That would make String.length O(n).

So? List.length is O(n) too. And String char count isn't even the right measurement for most activities you want to do with a string. (What would rust do?)

Mario Carneiro (Sep 17 2024 at 20:41):

I don't like the term "string length" because it's ambiguous. Even worse if it's next to "string size"

Mario Carneiro (Sep 17 2024 at 20:41):

I don't think anyone can reliably guess which one is which

François G. Dorais (Sep 17 2024 at 20:42):

Agreed. Might be worth making an RFC?

Mario Carneiro (Sep 17 2024 at 20:42):

for which part?

François G. Dorais (Sep 17 2024 at 20:43):

Deleting m_length (equivalently implementing String as a lean_sarray_object and removing lean_string_object).

François G. Dorais (Sep 17 2024 at 20:50):

The RFC would mostly exist to record the idea. It's fairly obvious that it would get the lowest possible priority, at best.

Henrik Böving (Sep 17 2024 at 21:07):

We have floated this idea internally at the FRO before. I think an RFC that tries to address this would at least be of interested, can't make any guarantees on whether it will be accepted of course.

François G. Dorais (Sep 18 2024 at 00:09):

I'll put that RFC on my todo list. It's low priority to me as well so if someone gets around to it before me then I will be happy with that.

Last updated: May 02 2025 at 03:31 UTC