Zulip Chat Archive
Stream: new members
Topic: How to read binary files into Lean's type?
Yicheng Tao (Feb 12 2025 at 12:53):
I'm trying to reproduce https://jrdngr.github.io/pngme_book/ in Lean, but there seems to be many problems about low-level bytes interpretation.
As an example, I have the following structures defined in Lean:
structure ChunkType where
ancillary : Char
private_ : Char
reserved: Char
safe_to_copy: Char
structure Chunk where
length : UInt32
chunk_type : ChunkType
data : Array UInt8
crc : UInt32
If I want to read a binary file with IO.FS.readBinFile
, it returns ByteArray
. How do I turn this ByteArray
into a Chunk
?
I've tried some FFI to handle the transformation between UInt32
and UInt8
, but it's quite ugly. Is there a right way to turn a UInt32
to 4 UInt8
? More generally, for any data type, how to turn it into a ByteArray
?
Henrik Böving (Feb 12 2025 at 13:43):
You're asking two questions here, one is how do we deserialize binary structures and the other is how do we serialize them.
For deserializing the best thing I can offer is docs#Std.Internal.Parsec.ByteArray.Parser which is currently not a stabilized API but works well for high performance parsing of binary data as for example implemented in the LRAT checker for bv_decide
.
For serializing data there does not really exist a framework yet. Do note that it is possible to turn a UInt32
into a UInt8
in at least two ways (little and big endian) so there is quite some work to be done to support a comprehensive serialisation framework for stuff like this.
Yicheng Tao (Feb 12 2025 at 14:33):
It seems much more complicated than in Rust. I wonder why we don't have implementation for something like .bytes()
and .from_bytes()
for the basic data types.
Currently I don't quite understand how docs#Std.Internal.Parsec.ByteArray.Parser works. At first glance I need to manually implement a parser for my structure.
Anyway thanks for your answer. I will try to figure it out.
Henrik Böving (Feb 12 2025 at 14:49):
It seems much more complicated than in Rust. I wonder why we don't have implementation for something like
.bytes()
and.from_bytes()
for the basic data types.
Because nobody has done it, it is entirely possible but we do not have infinite time.
At first glance I need to manually implement a parser for my structure.
Yes, nobody has implemented automated generation of parsers yet because there was no demand and time is not infinite.
You have to keep in mind that you are asking for a niche feature in an eco system that is currently mostly focused on developing everything you need for proving stuff and has only recently started systematically working on programming related APIs while Rust has had a literal decade of hundreds of people writing open source libraries to do these kind of things. If you want this kind of feature you could be the person that writes the library that people end up using^^
Julian Berman (Feb 12 2025 at 14:53):
Do we have anyone working on the infinite time problem? That seems nice to solve.
Tomas Skrivan (Feb 12 2025 at 16:44):
I should have more or less all the main pieces to get this working. (with one caveat, it can support only fixed size arrays in a struct i.e. you have to use Vector
instead of Array
)
In the library I'm working on, SciLean, I have a class PlainDataType X
allowing you write X
to/from ByteArray
. One restriction is that X
has to have fixed bit/byte size hence you can't provide in instance PlainDataType (Array X)
but you can provide an instance PlainDataType (Vector X n)
.
(Also a warning: there are no proofs and the spec is most likely incomplete and likely inconsistent in some very niche scenarios. Anyway the code runs and so far I have not found any bugs)
PlainDataType
To write structures from/to byte arrays can use mathlib's proxy_equiv%
which generates equivalence between structure and some canonical type. For example, proxy_equiv% ChunkType
returns (_ : Char) × (_ : Char) × (_ : Char) × Char ≃ ChunkType
To get this working we need two pieces.
PlainDataType
instance for sigma type(dependent product(_ : α) × β
)- translate instance
PlainDataType
along an equivalence
With all of this you can do this:
(I switched from Char
to UInt64
as for Char
I would have to provide BitType
instead of ByteType
. This way it made it a bit simpler for me.)
structure ChunkType where
ancillary : UInt64
private_ : UInt64
reserved: UInt64
safe_to_copy: UInt64
-- translate instance `PlainDataType ((_ : UInt64) × (_ : UInt64) × (_ : UInt64) × UInt64)` to `PlainDataType ChunkType `
instance : PlainDataType ChunkType := PlainDataType.ofEquiv (proxy_equiv% ChunkType)
-- turn `ChunkType` into `ByteArray` (the function is a bit of a mess because of missing API)
def ChunkType.toByteArray (x : ChunkType) : ByteArray :=
(PlainDataType.btype (α := ChunkType)).getRight (by decide)
|>.toByteArray (ByteArray.mk (Array.mkArray (4*8) 6)) 0 (sorry_proof) x
-- [1, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 0]
#eval (ChunkType.mk 1 (2+2*256) (3+3*65536) (4+4*16777216)).toByteArray
mwe using SciLean
(you can just rip this file out of the project, you will have to remove the import SciLean.Data.IndexType
and the corresponding instance somewhere in the file)
code
Tomas Skrivan (Feb 12 2025 at 16:50):
So for your application you need to
- do a complete implementation of the instance
instPlainDataTypeSigma
. This should be just copy/paste ofinstPlainDataTypeProd
where you switch from(,)
to⟨,⟩
where appropriate - provide instance
PlainDataType α → PlainDataType (Vector α n)
- redefine
Chunk
to
structure Chunk where
length : UInt32
chunk_type : ChunkType
data_size : UInt32
data : Vector UInt8 data_size.toNat
crc : UInt32
Tomas Skrivan (Feb 12 2025 at 16:54):
I would be happy for a PR to SciLean doing this :) also together with a deriving
extension such that you can just write
structure Chunk where
length : UInt32
chunk_type : ChunkType
data : Vector UInt8 length.toNat
crc : UInt32
deriving PlainDataType
Yicheng Tao (Feb 13 2025 at 05:49):
Tomas Skrivan said:
So for your application you need to
- do a complete implementation of the instance
instPlainDataTypeSigma
. This should be just copy/paste ofinstPlainDataTypeProd
where you switch from(,)
to⟨,⟩
where appropriate- provide instance
PlainDataType α → PlainDataType (Vector α n)
- redefine
Chunk
tostructure Chunk where length : UInt32 chunk_type : ChunkType data_size : UInt32 data : Vector UInt8 data_size.toNat crc : UInt32
It looks promising, I'll try. Nice idea to have a PlainDataType
class.
Tomas Skrivan (Feb 13 2025 at 15:41):
Ah I missed one thing, Chunk
still does not have a fixed byte size as defined which is necessary for PlainDataType
. There are two options:
- Work with
Chunk data_size
i.e. parametrizeChunk
by the size ofdata
. Now you have a type with a fixed byte size. - Generalize
PlainDataType
to types that don't have fixed byte size. Probably every type will have a well defined minimal byte size and from reading this minimal part you can infer the actual size of the element.
Of course you can kind of mix 1 and 2. Store at the beginning ByteArray
the value of data_size
and after that store Chunk data_size
.
Last updated: May 02 2025 at 03:31 UTC