Zulip Chat Archive

Stream: lean4

Topic: Core Stream class


François G. Dorais (Jan 16 2023 at 21:41):

The core Stream class is intended to be usable in monadic code, but the return type of Stream.next? looks incorrect to me. Right now it returns Option (value × stream) but it should return Option value × stream. The idea is that next? could have side-effects even if there is no output, so it should update the stream state. For example, if a stream processes data from another input stream, it receives some data from the input but not enough to produce an output, returning none simply loses the partial data.

Is there a better stream class elsewhere? Or is this a defect with the current Stream class?

Mario Carneiro (Jan 16 2023 at 22:13):

I think the class is okay as is. If a stream has a "partial-next" then I think it is difficult to describe what the output type of that partial result is, and it is better to just have a special (non-typeclass) method on the object to express that.

François G. Dorais (Jan 16 2023 at 22:42):

I thought the idea of making Stream a class was to keep it lightweight but still usable in monadic code. There are definitely many monadic StreamM and even StreamT options, but that's what I'm trying to avoid. (Am I falling into an anti-pattern?)

Since Stream is a pure class, Stream.next? can't have side-effects on its own and so any "partial-next" would have to be stored in the stream state. It's the job of the user to work with the enclosing monad to keep track of the state change.

My use case is with unicode normalization where it makes perfect sense to gradually process a string, especially inside monadic code like inside IO, but it makes little sense to add a layer the monad stack just for that.

Mario Carneiro (Jan 16 2023 at 23:30):

Note that Stream is used by do notation, so its type is partially affected by the use case there

Mario Carneiro (Jan 16 2023 at 23:32):

What I'm saying is that you could have a UnicodeStream.partialNext? : UnicodeStream -> Except PartialChar (Char × UnicodeStream) and then wrap that for the typeclass next? operation

François G. Dorais (Jan 16 2023 at 23:32):

Yes, it's embedded really deep! I'm toying with the idea of adding a Std.Stream class but I'm not sure, that's why I'm asking.

François G. Dorais (Jan 16 2023 at 23:36):

(Out of order reply) That makes sense but there's lots of normalizations and intermediate steps, each of which has their own PartialChar type. Some of these are internal where it doesn't matter, but some are user facing. There's also the issue when the input reaches a hard ending but the last character hasn't been output yet...

Mario Carneiro (Jan 16 2023 at 23:38):

I thought the last thing was what PartialChar was

Mario Carneiro (Jan 16 2023 at 23:38):

If you run out of the current buffer but you aren't at the actual end, you shouldn't be returning none

François G. Dorais (Jan 16 2023 at 23:39):

Exactly!

Mario Carneiro (Jan 16 2023 at 23:39):

I'm not even sure you should be using a stream for that

François G. Dorais (Jan 16 2023 at 23:39):

That's why I asked for another option.

Mario Carneiro (Jan 16 2023 at 23:40):

the return value can be a sum type representing the possible completions, e.g. "need more input", "invalid UTF8", "input ended with partial char", "complete char"

François G. Dorais (Jan 16 2023 at 23:41):

There's no need for that complexity.

Mario Carneiro (Jan 16 2023 at 23:41):

if you want to make a state machine for utf8 then that's roughly what you would need to have internally

François G. Dorais (Jan 16 2023 at 23:42):

Normalization works with code points, not utf8 specifically.

Mario Carneiro (Jan 16 2023 at 23:42):

I'm not sure I would expose any of it to the user except for a basic stream of chars

Mario Carneiro (Jan 16 2023 at 23:43):

normalization has similar characteristics to utf8 chars in this regard I think

François G. Dorais (Jan 16 2023 at 23:43):

The Unicode standard recommends it (and I recommend it too!)

Mario Carneiro (Jan 16 2023 at 23:43):

why should the user need to care about it?

François G. Dorais (Jan 16 2023 at 23:44):

utf8 is a specific encoding of Unicode characters. It has nothing to do with what I care about.

Mario Carneiro (Jan 16 2023 at 23:45):

sure, my point is only that normalization involves looking at some units and merging them together

Mario Carneiro (Jan 16 2023 at 23:45):

so if you are in a streaming context then you have to remember whether you are mid-merge

François G. Dorais (Jan 16 2023 at 23:45):

It's about accented characters and their many equivalent representations as actual String type.

Mario Carneiro (Jan 16 2023 at 23:46):

Obviously it's going to be a lot simpler to do normalization as a bulk operation

François G. Dorais (Jan 16 2023 at 23:46):

What bulk?

Mario Carneiro (Jan 16 2023 at 23:46):

the string

Mario Carneiro (Jan 16 2023 at 23:47):

input: unnormalized string, output: normalized string

Mario Carneiro (Jan 16 2023 at 23:47):

simple type, hard to misuse

François G. Dorais (Jan 16 2023 at 23:49):

Yes, but you're thinking of normalization is a one-step procedure. It's not, there's several normalization types and steps and the intermediate steps have their own uses, so it's optimal to process as a stream, piggy backing each step to achieve linear-ish processing.

François G. Dorais (Jan 16 2023 at 23:51):

The stream type I originally described is perfect for this.

Mario Carneiro (Jan 16 2023 at 23:53):

Yes, but you're thinking of normalization is a one-step procedure. It's not, there's several normalization types and steps and the intermediate steps have their own uses, so it's optimal to process as a stream, piggy backing each step to achieve linear-ish processing.

From a user standpoint, it is simpler to present it as a one-step procedure. Anything more than that is decidedly advanced usage

Mario Carneiro (Jan 16 2023 at 23:55):

Moreover, I'm fairly certain that layering a bunch of option functions processing characters will result in a slower code than a bunch of passes over the data. You should benchmark the options first

François G. Dorais (Jan 16 2023 at 23:55):

Yes, but as I said before, the Unicode standard recommends exposing some (but not all) intermediate steps to allow users to implement optimizations (e.g. locale-specific optimizations).

Mario Carneiro (Jan 16 2023 at 23:57):

anyway, you don't need a stream typeclass to do what you want. Just have functions that return whatever you find preferable

François G. Dorais (Jan 17 2023 at 00:00):

Mario Carneiro said:

Moreover, I'm fairly certain that layering a bunch of option functions processing characters will result in a slower code than a bunch of passes over the data. You should benchmark the options first

I thought so too but no, the most efficient way is to skip characters where the transformation stack would be useless (there's always a table for that) and only apply the transformations when necessary. So the optimizations are always handled in the earliest transformation step. This is built-into the standard and annexes.

Mario Carneiro (Jan 17 2023 at 00:03):

I'm thinking more about the cost of the operations themselves in lean's runtime rather than the algorithmic cost. Again, benchmarking is key

François G. Dorais (Jan 17 2023 at 00:04):

Mario Carneiro said:

anyway, you don't need a stream typeclass to do what you want. Just have functions that return whatever you find preferable

Yes, it's easy to implement a Unicode.Stream class that does the right thing. Which brings me back to my original question: Is there a better steam class elsewhere?

Mario Carneiro (Jan 17 2023 at 00:04):

I don't think you need a class at all

Mario Carneiro (Jan 17 2023 at 00:05):

just plain functions

Mario Carneiro (Jan 17 2023 at 00:05):

I'm not ready to generalize from this example to a typeclass with a big API

François G. Dorais (Jan 17 2023 at 00:06):

What big API? There's just one function!?

Mario Carneiro (Jan 17 2023 at 00:18):

If the one function is all you want, then why bother with a class?

François G. Dorais (Jan 17 2023 at 00:38):

I think we're running in circles here.

  • I want to use a class because (1) a unified interface is useful (2) it's reused in the codebase a lot and (3) there are user facing applications. (The last point is why I'm asking about existing options.)
  • I don't want to use a monad or anything heavyweight or too specialized because the API has only one nontrivial function (which is exactly the same signature in all use cases, but different implementations).
  • It's only a tiny variation of an existing class. (Which is baked hard in core, so hard to change.)

Mario Carneiro (Jan 17 2023 at 01:06):

In that case you should make your own class and test it out

François G. Dorais (Jan 17 2023 at 01:11):

I did and it works fine! I wouldn't have asked this without testing it out first!

If there are no existing alternatives that meet my needs, then the second question still stands: is this a defect in the core Stream class? Or is it really intended as a "fused stream"?

François G. Dorais (Jan 17 2023 at 01:15):

(That's "fused" as in "fuse box", once it blows you have to replace it.)

Mario Carneiro (Jan 17 2023 at 02:05):

It is clearly a fused stream

James Gallicchio (Jan 17 2023 at 04:15):

(which is a very useful class for representing iteration, since it has a clear indication of the end of iteration)

James Gallicchio (Jan 17 2023 at 04:18):

I suspect eventually we'll have a library of classes for expressing transformer pipelines, but like Mario said it feels super unclear what that should look like without examples.

I'm not sure I followed the Unicode example, but if you have the code pushed to a repository somewhere I'm interested in taking a look :D

Sebastian Ullrich (Jan 17 2023 at 09:24):

I didn't know this was allowed in e.g. Rust, but it still sounds crazy to me. Would be interesting to know how much they rely on it.

Individual iterator implementations may choose to resume iteration, and so calling next() again may or may not eventually start returning Some(Item) again at some point. https://doc.rust-lang.org/std/iter/trait.Iterator.html#tymethod.next

François G. Dorais (Jan 17 2023 at 12:43):

@James Gallicchio This is the basic code I am using:

/-- Stream class appropriate for monadic use. -/
class MStream (σ : Type _) (m : Type _  Type _) (α : outParam (Type _)) where
  /-- Attempt to get the next item -/
  next? : σ  m (Option α × σ)

instance (priority := low) (σ α m) [Monad m] [Stream σ α] : MStream σ m α where
  next? (s) := match Stream.next? s with
    | some (x, s) => pure (some x, s)
    | none => pure (none, s)

/-- `ForIn` for `MStream` -/
protected partial def MStream.forIn [Monad m] [MStream ρ m α] (s : ρ) (b : β) (f : α  β  m (ForInStep β)) : m β := do
  let _ : Inhabited (m β) := pure b
  let rec visit (s : ρ) (b : β) : m β := do
    match  next? s with
    | (some a, s) => match ( f a b) with
      | ForInStep.done b  => return b
      | ForInStep.yield b => visit s b
    | (none, _) => return b
    return b
  visit s b

instance (priority := low) [MStream ρ m α] : ForIn m ρ α where
  forIn := MStream.forIn

The monad m parameter would also be helpful addition to core Stream. As far as I can tell, these two instances are impossible with the current Stream class.

/-- IO.FS.Stream as a byte stream -/
instance : MStream IO.FS.Stream IO UInt8 where
  next? s := do
    let bs  s.read 1
    if h : bs.size > 0 then
      return (some <| bs.get 0,h⟩, s)
    else
      return (none, s)

/-- IO.FS.Stream as a line stream -/
instance : MStream IO.FS.Stream IO String where
  next? s := do
    let ln  s.getLine
    if ln.isEmpty then
      return (none, s)
    else
      return (some <| ln.dropRightWhile (·=='\n'), s)

@Sebastian Ullrich Rust gave me the idea to use the different signature.

François G. Dorais (Jan 17 2023 at 13:10):

One small advantage of using Option α × σ over Option (α × σ) is to avoid partial defs. For example:

structure LineStreamOfCharStream (σ) where
  stream : σ
  buffer : String := ""

instance [Monad m] [MStream σ m Char] : MStream (LineStreamOfCharStream σ) m String where
  next? s := do
    match  MStream.next? s.stream with
    | (some c, stream) =>
      if c == '\n' then
        return (some s.buffer, {stream := stream, buffer := ""})
      else
        return (none, {stream := stream, buffer := s.buffer.push c})
    | (none, stream) =>
      return (none, {s with stream := stream})

Since there's no way to know whether or when a newline will occur, a fused version of this would have to be partial or somehow break long lines.

James Gallicchio (Jan 17 2023 at 13:21):

Sebastian Ullrich said:

I didn't know this was allowed in e.g. Rust, but it still sounds crazy to me. Would be interesting to know how much they rely on it.

https://www.reddit.com/r/rust/comments/sbdb9t/comment/htzl4rp/ It kinda sounds like the API is a source of problems... But I've asked the local Rustaceans if they had any idea why it's like that, will update y'all :)

James Gallicchio (Jan 17 2023 at 13:30):

François G. Dorais said:

One small advantage of using Option α × σ over Option (α × σ) is to avoid partial defs. For example:

I have some infrastructure lying around for dealing with specifically finite iterators (it gives you a well-foundedness proof that the iteration eventually terminates, which satisfies Lean). And for potentially infinite iterators I assume eventually we'll have infrastructure around coinduction but I'm not sure unfused streams are the right way to go...

Eric Wieser (Jan 17 2023 at 13:31):

I guess in principle Python's __iter__ / __next__ framework allows these resumable streams too; but note that the coroutine syntax with yield does not support them, nor can I think of any examples in the standard library that behave in this way.

François G. Dorais (Jan 17 2023 at 14:01):

I agree that non-fused streams are strange. But after some thought I realized that the issue with Stream.next? is not the output type but the missing monad parameter. I think the best type would be this:

class Stream (σ : Type _) (m : Type _  Type _) (α : outParam (Type _)) where
  next : σ  m (α × σ)

That way Stream σ Id α is an infinite stream, Stream σ Option α is the same as the current class, and Stream σ (Except σ) α would basically be the same as the type I originally suggested. It also provides a direct way to turn IO.FS.Stream into a stream within IO.


Last updated: Dec 20 2023 at 11:08 UTC