Zulip Chat Archive

Stream: new members

Topic: How to read binary files into Lean's type?

Yicheng Tao (Feb 12 2025 at 12:53):

I'm trying to reproduce https://jrdngr.github.io/pngme_book/ in Lean, but there seems to be many problems about low-level bytes interpretation.
As an example, I have the following structures defined in Lean:

structure ChunkType where
  ancillary : Char
  private_ : Char
  reserved: Char
  safe_to_copy: Char

structure Chunk where
  length     : UInt32
  chunk_type : ChunkType
  data       : Array UInt8
  crc        : UInt32

If I want to read a binary file with IO.FS.readBinFile, it returns ByteArray. How do I turn this ByteArray into a Chunk?
I've tried some FFI to handle the transformation between UInt32 and UInt8, but it's quite ugly. Is there a right way to turn a UInt32 to 4 UInt8? More generally, for any data type, how to turn it into a ByteArray?

Henrik Böving (Feb 12 2025 at 13:43):

You're asking two questions here, one is how do we deserialize binary structures and the other is how do we serialize them.

For deserializing the best thing I can offer is docs#Std.Internal.Parsec.ByteArray.Parser which is currently not a stabilized API but works well for high performance parsing of binary data as for example implemented in the LRAT checker for bv_decide.

For serializing data there does not really exist a framework yet. Do note that it is possible to turn a UInt32 into a UInt8 in at least two ways (little and big endian) so there is quite some work to be done to support a comprehensive serialisation framework for stuff like this.

Yicheng Tao (Feb 12 2025 at 14:33):

It seems much more complicated than in Rust. I wonder why we don't have implementation for something like .bytes() and .from_bytes() for the basic data types.
Currently I don't quite understand how docs#Std.Internal.Parsec.ByteArray.Parser works. At first glance I need to manually implement a parser for my structure.
Anyway thanks for your answer. I will try to figure it out.

Henrik Böving (Feb 12 2025 at 14:49):

It seems much more complicated than in Rust. I wonder why we don't have implementation for something like .bytes() and .from_bytes() for the basic data types.

Because nobody has done it, it is entirely possible but we do not have infinite time.

At first glance I need to manually implement a parser for my structure.

Yes, nobody has implemented automated generation of parsers yet because there was no demand and time is not infinite.

You have to keep in mind that you are asking for a niche feature in an eco system that is currently mostly focused on developing everything you need for proving stuff and has only recently started systematically working on programming related APIs while Rust has had a literal decade of hundreds of people writing open source libraries to do these kind of things. If you want this kind of feature you could be the person that writes the library that people end up using^^

Julian Berman (Feb 12 2025 at 14:53):

Do we have anyone working on the infinite time problem? That seems nice to solve.

Tomas Skrivan (Feb 12 2025 at 16:44):

I should have more or less all the main pieces to get this working. (with one caveat, it can support only fixed size arrays in a struct i.e. you have to use Vector instead of Array)

In the library I'm working on, SciLean, I have a class PlainDataType X allowing you write X to/from ByteArray. One restriction is that X has to have fixed bit/byte size hence you can't provide in instance PlainDataType (Array X) but you can provide an instance PlainDataType (Vector X n).

(Also a warning: there are no proofs and the spec is most likely incomplete and likely inconsistent in some very niche scenarios. Anyway the code runs and so far I have not found any bugs)

PlainDataType

structure BitType (α : Type*) where
  bits : UInt8
  h_size : bits ≤ 8  -- we consider only types fitting into a single byte
  fromByte (b : UInt8) : α
  toByte   (a : α) : UInt8
  -- TODO: Add condition that toByte sets all unused bits to zero
  -- TODO: Add condition that fromByte does not use any of the unused bits
  fromByte_toByte : ∀ a, fromByte (toByte a) = a

structure ByteType (α : Type*) where
  bytes : USize
  h_size : 1 < bytes  -- for one byte types use BitType
  /-- Read `a : α` from byte array `b` starting at the byte `i` -/
  fromByteArray (b : ByteArray) (i : USize) (h : (i+bytes).toNat ≤ b.size) : α
  /-- Write `a : α` from byte array `b` starting at the byte `i` -/
  toByteArray   (b : ByteArray) (i : USize) (h : (i+bytes).toNat ≤ b.size) (a : α) : ByteArray

  /-- `toByteArray` does not modify ByteArray size -/
  toByteArray_size : ∀ b i h a, (toByteArray b i h a).size = b.size
  /-- we can recover `a` from bytes -/
  fromByteArray_toByteArray : ∀ a b i h h', fromByteArray (toByteArray b i h a) i h' = a
  /-- `toByteArray` does not affect other bytes -/
  fromByteArray_toByteArray_other : ∀ a b i j h, (j < i) ∨ (i + bytes) ≤ j → (toByteArray b i h a).uget j sorry_proof = b.uget j sorry_proof

class PlainDataType (α : Type*) where
  btype : BitType α ⊕ ByteType α

To write structures from/to byte arrays can use mathlib's proxy_equiv% which generates equivalence between structure and some canonical type. For example, proxy_equiv% ChunkType returns (_ : Char) × (_ : Char) × (_ : Char) × Char ≃ ChunkType

To get this working we need two pieces.

PlainDataType instance for sigma type(dependent product (_ : α) × β)
translate instance PlainDataType along an equivalence

With all of this you can do this:
(I switched from Char to UInt64 as for Char I would have to provide BitType instead of ByteType. This way it made it a bit simpler for me.)

structure ChunkType where
  ancillary : UInt64
  private_ : UInt64
  reserved: UInt64
  safe_to_copy: UInt64

-- translate instance `PlainDataType ((_ : UInt64) × (_ : UInt64) × (_ : UInt64) × UInt64)` to `PlainDataType ChunkType `
instance : PlainDataType ChunkType := PlainDataType.ofEquiv (proxy_equiv% ChunkType)

-- turn `ChunkType` into `ByteArray` (the function is a bit of a mess because of missing API)
def ChunkType.toByteArray (x : ChunkType) : ByteArray :=
  (PlainDataType.btype (α := ChunkType)).getRight (by decide)
    |>.toByteArray (ByteArray.mk (Array.mkArray (4*8) 6)) 0 (sorry_proof) x

-- [1, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 0]
#eval (ChunkType.mk 1 (2+2*256) (3+3*65536) (4+4*16777216)).toByteArray

mwe using SciLean
(you can just rip this file out of the project, you will have to remove the import SciLean.Data.IndexType and the corresponding instance somewhere in the file)

code

import SciLean.Data.DataArray.PlainDataType
import Mathlib.Tactic.ProxyType

open SciLean


def Prod.byteTypeSigma {α β} (ta : ByteType α) (tb : ByteType β) : ByteType ((_ : α) × β) :=
  {
    bytes := ta.bytes + tb.bytes
    h_size := sorry_proof

    fromByteArray := λ arr i _ =>
      ⟨ta.fromByteArray arr i sorry_proof,
       tb.fromByteArray arr (i+ta.bytes) sorry_proof⟩
    toByteArray := λ arr i _ ⟨a,b⟩ =>
      arr |> (ta.toByteArray · (i) sorry_proof a)
          |> (tb.toByteArray · (i+ta.bytes) sorry_proof b)

    toByteArray_size := sorry_proof
    fromByteArray_toByteArray := sorry_proof
    fromByteArray_toByteArray_other := sorry_proof
  }

local instance [Inhabited α] : Inhabited (ByteType α) := {
  default := {
    bytes := panic! "not implemented"
    h_size := sorry_proof
    fromByteArray b i h := panic! "not implemented"
    toByteArray b i h a := panic! "not implemented"
    toByteArray_size := sorry_proof
    fromByteArray_toByteArray := sorry_proof
    fromByteArray_toByteArray_other := sorry_proof
  }
}

instance instPlainDataTypeSigma [Inhabited α] [Inhabited β]
    [ta : PlainDataType α] [tb : PlainDataType β] :
    PlainDataType ((_:α)×β) where
  btype :=
    match ta.btype, tb.btype with
    | .inl aBitType,  .inl bBitType  => .inr <| panic! "not implemented"
    | .inl aBitType,  .inr bByteType => .inr <| panic! "not implemented"
    | .inr aByteType, .inl bBitType  => .inr <| panic! "not implemented"
    | .inr aByteType, .inr bByteType => .inr <| Prod.byteTypeSigma aByteType bByteType


def PlainDataType.ofEquiv {α β : Type*} [p : PlainDataType α] (f : α ≃ β) : PlainDataType β where
  btype :=
    match p.btype with
    | .inl bt => .inl {
      bits := bt.bits
      h_size := bt.h_size
      fromByte b := f (bt.fromByte b)
      toByte a := bt.toByte (f.symm a)
      fromByte_toByte := sorry_proof
    }
    | .inr bt => .inr {
      bytes := bt.bytes
      h_size := bt.h_size
      fromByteArray b i h := f (bt.fromByteArray b i h)
      toByteArray b i h a := bt.toByteArray b i h (f.symm a)
      toByteArray_size := sorry_proof
      fromByteArray_toByteArray := sorry_proof
      fromByteArray_toByteArray_other := sorry_proof
    }

structure ChunkType where
  ancillary : UInt64
  private_ : UInt64
  reserved: UInt64
  safe_to_copy: UInt64

instance : PlainDataType ChunkType := PlainDataType.ofEquiv (proxy_equiv% ChunkType)

def ChunkType.toByteArray (x : ChunkType) : ByteArray :=
  (PlainDataType.btype (α := ChunkType)).getRight (by decide)
    |>.toByteArray (ByteArray.mk (Array.mkArray (4*8) 6)) 0 (sorry_proof) x

#eval (ChunkType.mk 1 (2+2*256) (3+3*65536) (4+4*16777216)).toByteArray

Tomas Skrivan (Feb 12 2025 at 16:50):

So for your application you need to

do a complete implementation of the instance instPlainDataTypeSigma. This should be just copy/paste of instPlainDataTypeProd where you switch from (,) to ⟨,⟩ where appropriate
provide instance PlainDataType α → PlainDataType (Vector α n)
redefine Chunk to

structure Chunk where
  length     : UInt32
  chunk_type : ChunkType
  data_size  : UInt32
  data       : Vector UInt8 data_size.toNat
  crc        : UInt32

Tomas Skrivan (Feb 12 2025 at 16:54):

I would be happy for a PR to SciLean doing this :) also together with a deriving extension such that you can just write

structure Chunk where
  length     : UInt32
  chunk_type : ChunkType
  data       : Vector UInt8 length.toNat
  crc        : UInt32
  deriving PlainDataType

Yicheng Tao (Feb 13 2025 at 05:49):

Tomas Skrivan said:

So for your application you need to

do a complete implementation of the instance instPlainDataTypeSigma. This should be just copy/paste of instPlainDataTypeProd where you switch from (,) to ⟨,⟩ where appropriate

provide instance PlainDataType α → PlainDataType (Vector α n)

redefine Chunk to
structure Chunk where
  length     : UInt32
  chunk_type : ChunkType
  data_size  : UInt32
  data       : Vector UInt8 data_size.toNat
  crc        : UInt32

It looks promising, I'll try. Nice idea to have a PlainDataType class.

Tomas Skrivan (Feb 13 2025 at 15:41):

Ah I missed one thing, Chunk still does not have a fixed byte size as defined which is necessary for PlainDataType. There are two options:

Work with Chunk data_size i.e. parametrize Chunk by the size of data. Now you have a type with a fixed byte size.
Generalize PlainDataType to types that don't have fixed byte size. Probably every type will have a well defined minimal byte size and from reading this minimal part you can infer the actual size of the element.

Of course you can kind of mix 1 and 2. Store at the beginning ByteArray the value of data_size and after that store Chunk data_size.

Last updated: May 02 2025 at 03:31 UTC