Documentation

Lean.Language.Lean

Note [Incremental Parsing] #

In the language server, we want to minimize the work we do after each edit by reusing previous state where possible. This is true for both parsing, i.e. reusing syntax trees without running the parser, and elaboration. For both, we currently assume that we have to reprocess at least everything from the point of change downwards. This note is about how to find the correct starting point for incremental parsing; for elaboration, we then start with the first changed syntax tree.

One initial thought about incremental parsing could be that it's not necessary as parsing is very fast compared to elaboration; on mathlib we average 41ms parsing per 1000 LoC. But there are quite a few files >= 1kloc (up to 4.5kloc) in there, so near the end of such files lag from always reparsing from the beginning may very well be noticeable.

So if we do want incremental parsing, another thought might be that a user edit can only invalidate commands at or after the location of the change. Unfortunately, that's not true; take the (partial) input def a := b private def c. If we remove the space after private, the two commands syntactically become one with an application of privatedef to b even though the edit was strictly after the end of the first command.

So obviously we must include at least the extent of the token that made the parser stop parsing a command as well such that invalidating the private token invalidates the preceding command. Unfortunately this is not sufficient either, given the following input:

structure a where /-- b -/ @[c] private axiom d : Nat

This is a syntactically valid sequence of a field-less structure and a declaration. If we again delete the space after private, it becomes a syntactically correct structure with a single field privateaxiom! So clearly, because of uses of atomic in the grammar, an edit can affect a command syntax tree even across multiple tokens.

What we did in Lean 3 was to always reparse the last command completely preceding the edit location. If its syntax tree is unchanged, we preserve its data and reprocess all following commands only, otherwise we reprocess it fully as well. This worked well but did seem a bit arbitrary given that even if it works for a grammar at some point, it can certainly be extended in ways that break the assumption.

With grammar changes in Lean 4, we found that the following example indeed breaks this assumption:

structure Signature where
  /-- a docstring -/
  Sort : Type
    --^ insert: "s"

As the keyword Sort is not a valid start of a structure field and the parser backtracks across the docstring in that case, this is parsed as the complete command structure Signature where followed by the partial command /-- a docstring -/ <missing>. If we insert an s after the t, the last command completely preceding the edit location is the partial command containing the docstring. Thus we need to go up two commands to ensure we reparse the structure command as well. This kind of nested docstring is the only part of the grammar to our knowledge that requires going up at least two commands; as we never backtrack across more than one docstring, going up two commands should also be sufficient.

Finally, a more actually principled and generic solution would be to invalidate a syntax tree when the parser has reached the edit location during parsing. If it did not, surely the edit cannot have an effect on the syntax tree in question. Sadly such a "high-water mark" parser position does not exist currently and likely it could at best be approximated by e.g. "furthest tokenFn parse". Thus we remain at "go up two commands" at this point.

Note [Incremental Command Elaboration] #

Because of Lean's use of persistent data structures, incremental reuse of fully elaborated commands is easy because we can simply snapshot the entire state after each command and then restart elaboration using the stored state at the next command above the point of change. However, incrementality within elaboration of a single command such as between tactic steps is much harder because the existing control flow does not allow us to simply return from those points to the language processor in a way that we can later resume from there. Instead, we exchange the need for continuations with some limited mutability: by allocating an IO.Promise "cell" in the language processor, we can both pass it to the elaborator to eventually fill it using Promise.resolve as well as convert it to a Task that will wait on that resolution using Promise.result and return it as part of the command snapshot created by the language processor. The elaborator can then in turn create new promises itself and store their result when resolving an outer promise to create an arbitrary tree of promise-backed snapshot tasks. Thus, we can enable incremental reporting and reuse inside the elaborator using the same snapshot tree data structures as outside without having to change the elaborator's control flow.

While ideally we would decide what can be reused during command elaboration using strong hashes over the full state and inputs, currently we rely on simpler syntactic checks: if all the syntax inspected up to a certain point is unchanged, we can assume that the old state can be reused. The central SnapshotBundle type passed inwards through the elaborator for this purpose combines the following data:

Note that as we do not wait for the previous run to finish before starting to elaborate the next one, the old SnapshotTask task may not be finished yet. Indeed, if we do find that we can reuse the contained state because of a successful syntax comparison, we always want to explicitly wait for the task instead of redoing the work. On the other hand, the Syntax is not surrounded by a task so that we can immediately access it for comparisons, even if the snapshot task may, eventually, give access to the same syntax tree.

For the most part, inside an elaborator participating in incrementality, we just have to ensure that we stop forwarding the old run's data as soon as we notice a relevant difference between old and new syntax tree. For example, allowing incrementality inside the cdot tactic combinator is as simple as

builtin_initialize registerBuiltinIncrementalTactic ``cdot
@[builtin_tactic cdot] def evalTacticCDot : Tactic := fun stx => do
  ...
  closeUsingOrAdmit do
    -- save state before/after entering focus on `·`
    ...
    Term.withNarrowedArgTacticReuse (argIdx := 1) evalTactic stx

The Term.withNarrowedArgTacticReuse combinator will focus on the given argument of stx, which in this case is the nested tactic sequence, and run evalTactic on it. But crucially, it will first compare all preceding arguments, in this case the cdot token itself, with the old syntax in the current snapshot bundle, which in the case of tactics is stored in Term.Context.tacSnap?. Indeed it is important here to check if the cdot token is identical because its position has been saved in the info tree, so it would be bad if we later restored some old state that uses a different position for it even if everything else is unchanged. If there is any mismatch, the bundle's old value is set to none in order to prevent reuse from this point on. Note that in any case we still want to forward the "new" promise in order to provide incremental reporting as well as to construct a snapshot tree for reuse in future document versions! Note also that we explicitly opted into incrementality using registerBuiltinIncrementalTactic as any tactic combinator not written with these concerns in mind would likely misbehave under incremental reuse.

While it is generally true that we can provide incremental reporting even without reuse, we generally want to avoid that when it would be confusing/annoying, e.g. when a tactic block is run multiple times because otherwise the progress bar would snap back and forth multiple times. For this purpose, we can disable both incremental modes using Term.withoutTacticIncrementality, assuming we opted into incrementality because of other parts of the combinator. induction is an example of this because there are some induction alternatives that are run multiple times, so we disable all of incrementality for them.

Using induction as a more complex example than cdot as it calls into evalTactic multiple times, here is a summary of what it has to do to implement incrementality:

Lean-specific processing context.

Instances For
    @[reducible, inline]
    abbrev Lean.Language.Lean.LeanProcessingT (m : TypeType u_1) (α : Type) :
    Type u_1

    Monad transformer holding all relevant data for Lean processing.

    Equations
    Instances For
      @[reducible, inline]

      Monad holding all relevant data for Lean processing.

      Equations
      Instances For
        Equations
        • One or more equations did not get rendered due to their size.

        Embeds a LeanProcessingM action into ProcessingM, optionally using the old input string to speed up reuse analysis and supplying a cancellation token that should be triggered as soon as reuse is ruled out.

        Equations
        • One or more equations did not get rendered due to their size.
        Instances For

          Returns true if there was a previous run and the given position is before any textual change compared to it.

          Equations
          Instances For

            Result of retrieving additional metadata about the current file after parsing imports. In the language server, these are derived from the lake setup-file result. On the cmdline and for similar simple uses, these can be computed eagerly without looking at the imports.

            • mainModuleName : Lean.Name

              Module name of the file being processed.

            • Options provided outside of the file content, e.g. on the cmdline or in the lakefile.

            • trustLevel : UInt32

              Kernel trust level.

            Instances For

              Performance option used by cmdline driver.

              Parses values of options registered during import and left by the C++ frontend as strings, fails if any option names remain unknown.

              Equations
              • One or more equations did not get rendered due to their size.
              Instances For

                Entry point of the Lean language processor.

                The setupImports function is called after the header has been parsed and before the first command is parsed in order to supply additional file metadata (or abort with a given terminal snapshot); see SetupImportsResult.

                old? is a previous resulting snapshot, if any, to be reused for incremental processing.

                Equations
                • One or more equations did not get rendered due to their size.
                Instances For

                  Convenience function for tool uses of the language processor that skips header handling.

                  Equations
                  • One or more equations did not get rendered due to their size.
                  Instances For

                    Waits for and returns final command state, if importing was successful.

                    Equations
                    Instances For