Zulip Chat Archive

Stream: lean4

Topic: Java AST in Lean4

ohhaimark (Sep 01 2021 at 04:12):

What's the laziest way I can get a Java AST in Lean4?

Mario Carneiro (Sep 01 2021 at 04:22):

Have a java parser and print out the ast in s expressions or json or something

Mario Carneiro (Sep 01 2021 at 04:25):

Or write a java parser yourself in lean 4

Andrew Kent (Sep 01 2021 at 16:01):

I'm not sure what the laziest way is per se, but if you are wanting to define some types to represent the AST, the Java Language Specification isn't the worst document to work from at least.

If you're wanting a parser... these ANTLR Java grammars might be a helpful starting point. I don't recall if in Java you have to interleave parsing with type checking and other analyses (like you do for C or C++)... hopefully not... since that is an absolute nightmare to get right.

ohhaimark (Sep 01 2021 at 16:43):

I'm trying out the lean4 syntax stuff, but I'm not sure how to solve something:

I have the following setup:

syntax "java" "{" java_file "}" : term
syntax (pkg_decl)? import_decl* ty_decl* : java_file
syntax ident ("." ident)* : q_name
syntax "package" q_name ";" : pkg_decl
syntax "import" (ident ".")+ ("*" <|> ident) ";" : import_decl
syntax "public" <|> "protected" <|> "private" <|> "abstract" <|> "static" <|> "final" : modifier
syntax modifier* "class" ident noWs (ty_params)? (ty_super)? class_body : ty_decl
syntax "<" (ident (ty_super)?),* ">" : ty_params
syntax "<" (ty_ref),* ">" : ty_args
syntax ("extends" ty_ref)? ("implements" ty_ref,*)? : ty_super
syntax q_name (ty_args)? : ty_ref
syntax "{" class_body_decl* "}" : class_body
syntax modifier* (field_decl <|> ctor_or_method_decl) : class_body_decl
syntax ty ident ("=" expr)? ";" : field_decl
syntax (ty_params)? (ty_ret)? ident params (ty_throws)? : ctor_or_method_decl
syntax "void" <|> ty : ty_ret
syntax "(" param,* ")" : params
syntax ty ident : param
syntax "throws" ty_ref,+ : ty_throws

def test := java {
  package org.stnreu001.csc2002s;
  import java.util.*;

  public class HelloWorld<A> extends Comparator<A> {
    public int x = 0;
    public HelloWorld() {
    }
  }
}

But I don't know how to get the parser to distinguish between methods and fields.

Sebastian Ullrich (Sep 01 2021 at 16:49):

<|> does not backtrack if the LHS has consumed input, so you'll want to split

syntax modifier* (field_decl <|> ctor_or_method_decl) : class_body_decl

into two rules:

syntax modifier* field_decl : class_body_decl
syntax modifier* ctor_or_method_decl : class_body_decl

optionally factoring out the common modifier* prefix using yet another syntax category

ohhaimark (Sep 01 2021 at 17:05):

Is there a syntax category for number literals in lean? Where can I discover what syntax categories already exist?

Mario Carneiro (Sep 01 2021 at 17:06):

num

Mario Carneiro (Sep 01 2021 at 17:07):

It's not actually a syntax category, it's a parser (alias)

Mario Carneiro (Sep 01 2021 at 17:08):

They are sort of scattered all over, but that one is declared in a block with several others at the top of Lean/Parser.lean

ohhaimark (Sep 01 2021 at 18:03):

How do parsers deal with the parsing of Function<A,Function<B,C>> so you don't have to do c++ style `Function<A,Function<B,C> >?

Mario Carneiro (Sep 01 2021 at 19:27):

You put a hack in so that >> counts as two > when needed

Mario Carneiro (Sep 01 2021 at 19:28):

and I guess, >>> counts as three > (since java has >>> too IIRC)

Mario Carneiro (Sep 01 2021 at 19:31):

For a lean 4 parser, I would suggest replacing ">" in the ty_args and ty_params rules with a custom parser that consumes a > character and doesn't use tokens

Mac (Sep 01 2021 at 23:58):

Mario Carneiro said:

For a lean 4 parser, I would suggest replacing ">" in the ty_args and ty_params rules with a custom parser that consumes a > character and doesn't use tokens

This is actually much more difficult to do than it sounds since parsers to run for syntax categories are selected based on their tokens and their is no way (that I know of to so) to say declare that a parser can eat any token that begins with a '>' . This means that if another parser that consumes the full token exists, the parser will likely choose that first. You may be able to get around this by having your parser having no declared firstToken and a very high priority.

ohhaimark (Sep 02 2021 at 02:53):

Yay, I caused a stack overflow. It happened when I added ty_array.

declare_syntax_cat java_file
syntax "java" "{" java_file "}" : term

declare_syntax_cat q_name
syntax ident ("." ident)* : q_name

declare_syntax_cat pkg_decl
syntax "package" q_name ";" : pkg_decl

declare_syntax_cat import_decl
syntax "import" (ident ".")+ ("*" <|> ident) ";" : import_decl

declare_syntax_cat ty_ref
declare_syntax_cat ty_args
syntax ident (ty_args)? : ty_ref
syntax "<" (ty_ref),* ">" : ty_args

declare_syntax_cat ty_prim
syntax "int" <|> "long" <|> "float" <|> "double" <|> "boolean" : ty_prim

declare_syntax_cat ty
declare_syntax_cat ty_array
syntax ty "[" "]" : ty_array

syntax ty_array : ty
syntax ty_prim : ty
syntax ty_ref : ty

declare_syntax_cat ty_ret
syntax "void" : ty_ret
syntax ty : ty_ret


syntax q_name (ty_args)? : ty_ref

declare_syntax_cat ty_super
syntax ("extends" ty_ref)? ("implements" ty_ref,*)? : ty_super

declare_syntax_cat mod_access
syntax "public" <|> "protected" <|> "private" : mod_access

declare_syntax_cat mod_class_method
syntax (mod_access)? "abstract"? "final"? : mod_class_method

declare_syntax_cat mod_static_method
syntax (mod_access)? "static" "final"? : mod_static_method

declare_syntax_cat mod_method
syntax mod_class_method : mod_method
syntax mod_static_method : mod_method

declare_syntax_cat mod_field
syntax (mod_access)? "static"? "final"? : mod_field

declare_syntax_cat mod_local_var
syntax "final"? : mod_field


declare_syntax_cat ty_params
syntax "<" (ident (ty_super)?),* ">" : ty_params

declare_syntax_cat lit
syntax num : lit

declare_syntax_cat expr
syntax lit : expr
syntax ident : expr

declare_syntax_cat var_decl
syntax ty ident ("=" expr)? ";" : var_decl

declare_syntax_cat field_decl
syntax mod_field var_decl : field_decl

declare_syntax_cat class_body_decl
syntax field_decl : class_body_decl

declare_syntax_cat class_body
syntax "{" class_body_decl* "}" : class_body

declare_syntax_cat param
syntax ty ident : param

declare_syntax_cat params
syntax "(" param,* ")" : params

declare_syntax_cat ty_throws
syntax "throws" ty_ref,+ : ty_throws

declare_syntax_cat local_var_decl
syntax mod_local_var var_decl : local_var_decl

declare_syntax_cat assign
syntax expr "=" expr ";" : assign

declare_syntax_cat stmt
declare_syntax_cat block
syntax "{" stmt* "}" : block
syntax block : stmt
syntax var_decl : stmt
syntax local_var_decl : stmt
syntax assign : stmt

declare_syntax_cat method_body
syntax ";" : method_body
syntax block : method_body

declare_syntax_cat method_decl
syntax mod_method (ty_params)? ty_ret ident params (ty_throws)? method_body : method_decl

declare_syntax_cat ctor_decl
syntax mod_class_method (ty_params)? (ty_ret)? ident params (ty_throws)? method_body : ctor_decl

declare_syntax_cat ty_decl
syntax mod_method "class" ident (ty_params)? (ty_super)? class_body : ty_decl

syntax (pkg_decl)? import_decl* ty_decl* : java_file

syntax field_decl : class_body_decl
syntax ctor_decl : class_body_decl
syntax method_decl : class_body_decl

syntax num : lit


def test := java {
  package org.stnreu001.csc2002s;
  import java.util.*;

  public class HelloWorld<A extends Object implements List<B> > extends Comparator<A>{
    public static final int x = 0;

    public static void main(String[] args) {

    }

    public void HelloWorld() throws Exception {
      int y = 3;
      y = 4;
    }
  }
}

ohhaimark (Sep 02 2021 at 04:09):

Ah, fixed

declare_syntax_cat ty_no_array
syntax ty_prim : ty_no_array
syntax ty_ref : ty_no_array

declare_syntax_cat ty
syntax ty_no_array ("[" "]")* : ty

ohhaimark (Sep 02 2021 at 04:30):

I don't have to interleave parsing and typechecking at least, but I need to be able to differentiate type names from expression names. Would it be possible to reuse Lean4's existing binders and unification to resolve names in context and propagate information when a name is used like a type?

Mario Carneiro (Sep 02 2021 at 07:26):

Mac said:

Mario Carneiro said:

For a lean 4 parser, I would suggest replacing ">" in the ty_args and ty_params rules with a custom parser that consumes a > character and doesn't use tokens

This is actually much more difficult to do than it sounds since parsers to run for syntax categories are selected based on their tokens and their is no way (that I know of to so) to say declare that a parser can eat any token that begins with a '>' . This means that if another parser that consumes the full token exists, the parser will likely choose that first. You may be able to get around this by having your parser having no declared firstToken and a very high priority.

You can say that your parser will eat tokens >, >>, >>>

Mario Carneiro (Sep 02 2021 at 07:30):

@ohhaimark Where in the grammar do you need to distinguish type names from expression names? I thought java didn't inherit this cursed feature from C++

ohhaimark (Sep 02 2021 at 07:35):

In distinguishing method calls from static method calls.

typenamenotfollowingjavaconventions.staticcall()
localvariableinscope.membercall()

ohhaimark (Sep 02 2021 at 07:38):

The java spec talks about having different levels of knowledge about a name in its grammar. AmbigiousName, TypeName, PackageOrTypeName, etc.

Mario Carneiro (Sep 02 2021 at 07:45):

It seems like you can still parse that without knowing which is which, just have an ambiguous production and determine the difference at typechecking time

Sebastian Ullrich (Sep 02 2021 at 08:12):

Mac said:

Mario Carneiro said:

For a lean 4 parser, I would suggest replacing ">" in the ty_args and ty_params rules with a custom parser that consumes a > character and doesn't use tokens

This is actually much more difficult to do than it sounds since parsers to run for syntax categories are selected based on their tokens and their is no way (that I know of to so) to say declare that a parser can eat any token that begins with a '>' .

I don't see the issue, there are no category parsers starting with > here

Sebastian Ullrich (Sep 02 2021 at 08:41):

@ohhaimark Btw, instead of a single-parser syntax category, you can also use

syntax ty := ty_no_array ("[" "]")*

ohhaimark (Sep 02 2021 at 09:15):

Yay

syntax q_name := ident ("." ident)*

syntax pkg_decl := "package" q_name ";"

syntax import_decl := "import" (ident ".")+ ("*" <|> ident) ";"

declare_syntax_cat ty_ref
declare_syntax_cat ty_args
syntax ident (ty_args)? : ty_ref
syntax "<" (ty_ref),* ">" : ty_args

syntax ty_prim := "int" <|> "long" <|> "float" <|> "double" <|> "boolean"

declare_syntax_cat ty_no_array
syntax ty_prim : ty_no_array
syntax ty_ref : ty_no_array

syntax ty := ty_no_array ("[" "]")*

declare_syntax_cat ty_ret
syntax "void" : ty_ret
syntax ty : ty_ret

syntax q_name (ty_args)? : ty_ref

syntax ty_super := ("extends" ty_ref)? ("implements" ty_ref,*)?

syntax mod_access := "public" <|> "protected" <|> "private"

syntax mod_class_method := (mod_access)? "abstract"? "final"?

syntax mod_static_method := (mod_access)? "static" "final"?

declare_syntax_cat mod_method
syntax mod_class_method : mod_method
syntax mod_static_method : mod_method

syntax mod_field := (mod_access)? "static"? "final"?

syntax mod_local_var := "final"?

syntax ty_params := "<" (ident (ty_super)?),* ">"

declare_syntax_cat lit
syntax num : lit

syntax param := ty ident
syntax params := "(" param,* ")"

declare_syntax_cat expr

syntax args := "(" expr,* ")"

syntax ident_or_invoke := (ty_args)? ident (args)?
syntax dot := ident_or_invoke ("." ident_or_invoke)*

syntax lit : expr
syntax dot : expr

syntax var_decl := ty ident ("=" expr)? ";"

syntax field_decl := mod_field var_decl

declare_syntax_cat class_body_decl
syntax field_decl : class_body_decl

declare_syntax_cat ty_throws
syntax "throws" ty_ref,+ : ty_throws

declare_syntax_cat method_body

declare_syntax_cat ctor_decl
syntax ctor_decl : class_body_decl
syntax mod_class_method (ty_params)? ident params (ty_throws)? method_body : ctor_decl

declare_syntax_cat method_decl
syntax method_decl : class_body_decl
syntax mod_method (ty_params)? ty_ret ident params (ty_throws)? method_body : method_decl

syntax class_body := "{" class_body_decl* "}"

syntax local_var_decl := mod_local_var var_decl

syntax assign := expr "=" expr ";"

declare_syntax_cat stmt
syntax block := "{" stmt* "}"
syntax block : stmt
syntax var_decl : stmt
syntax local_var_decl : stmt
syntax assign : stmt

syntax ";" : method_body
syntax block : method_body

syntax ty_decl := mod_method "class" ident (ty_params)? (ty_super)? class_body

syntax num : lit

syntax java_compilation_unit := (pkg_decl)? import_decl* ty_decl*

syntax "java" "{" java_compilation_unit "}" : term

I should probably start putting this in a Gist or something.

ohhaimark (Sep 02 2021 at 09:16):

def test := java {
  package org.stnreu001.csc2002s;
  import java.util.*;

  public class Test <A extends Object implements List<B> > extends Comparator<A>{
    public static final int x = 0;

    public static void main(String[] args) {
      List<Map<String,Double> > listOfMaps;
    }

    public void HelloWorld() throws Exception {
      final int y = 3;
      y = 4;
      y.x = f(y.z);
    }
  }
}

ohhaimark (Sep 02 2021 at 09:28):

Can precedence help a production rule that's left recursive? e.g.

...
syntax ident_or_invoke := (ty_args)? ident (args)?
declare_syntax_cat expr
syntax dot := expr "." ident_or_invoke
syntax dot : expr
syntax ident_or_invoke : expr

Sebastian Ullrich (Sep 02 2021 at 10:17):

No. A syntax rule must be syntactically left-recursive to be handled correctly.

syntax expr "." ident_or_invoke : expr

ohhaimark (Sep 02 2021 at 10:46):

Can macro_rules for translating to AST give you pretty printing?

Sebastian Ullrich (Sep 02 2021 at 11:22):

Not directly, if I understood your question correctly, but the parser itself can contain hints for Lean's built-in pretty printer https://github.com/leanprover/lean4/blob/aba0a479ec221ad2e31de09c6e71ce82a705ec38/src/Lean/Parser/Extra.lean#L63-L78
https://github.com/leanprover/lean4/blob/aba0a479ec221ad2e31de09c6e71ce82a705ec38/src/Init/Notation.lean#L143

ohhaimark (Sep 02 2021 at 11:56):

Whats happening here

syntax "lolol" : term

inductive testlol := | lol

macro_rules
  | `(lolol) => ``(testlol.lol)


def yee : testlol := lolol

#eval yee

failed to synthesize
  Lean.MetaEval testlol

ohhaimark (Sep 02 2021 at 11:58):

Oh lol, has nothing to do with the macro stuff.

ohhaimark (Sep 02 2021 at 11:58):

inductive testlol := | lol

#eval (testlol.lol)

ohhaimark (Sep 02 2021 at 12:07):

Also with https://github.com/leanprover/lean4/blob/aba0a479ec221ad2e31de09c6e71ce82a705ec38/src/Init/Notation.lean#L410-L414, I see that the trivial tactic can macro expand into various other tactics. Does that mean the macro expander is backtracking?

Mario Carneiro (Sep 02 2021 at 13:18):

Sebastian Ullrich said:

ohhaimark Btw, instead of a single-parser syntax category, you can also use
syntax ty := ty_no_array ("[" "]")*

I thought you said that ("a" "b")* was bad practice?

ohhaimark (Sep 02 2021 at 14:56):

@Mario Carneiro Why is there no MetaEval instance for the inductive definitions I make?

Mario Carneiro (Sep 02 2021 at 14:56):

because one hasn't been defined

Mario Carneiro (Sep 02 2021 at 14:57):

What is it you want to get out of evaluating the expression?

Mario Carneiro (Sep 02 2021 at 14:57):

If you want to print it then use deriving Repr and then #eval repr testlol.lol

ohhaimark (Sep 02 2021 at 14:58):

Is there a deriving mechanism for it?
I expect inductive definitions to evaluate to themselves.
Oh, so it's also the printed output?
Can you select arbitrary reductions to be used?

Mario Carneiro (Sep 02 2021 at 14:59):

You can evaluate expressions of type testlol, but let _ := testlol.lol; () isn't very useful, you presumably want to print it at the end or something

ohhaimark (Sep 02 2021 at 15:00):

So it's just defining how to pretty print?

Mario Carneiro (Sep 02 2021 at 15:00):

and for that you need deriving Repr or ToString (I forget if that has a deriving instance)

Mario Carneiro (Sep 02 2021 at 15:00):

yes

Mario Carneiro (Sep 02 2021 at 15:00):

You can define a printer yourself too

Mario Carneiro (Sep 02 2021 at 15:01):

Or just use the inductive as an intermediate type and construct some base type at the end like strings or tuples or something

ohhaimark (Sep 02 2021 at 15:04):

Can you have deriving for mutual inductive types?

ohhaimark (Sep 02 2021 at 15:06):

Nvm, I just didn't put enough deriving Reprs.

ohhaimark (Sep 02 2021 at 15:19):

How would I write a syntax extension that takes

def Flags := List Flag
   deriving Repr Flags

and produces something like

def Flags := List Flag
instance : Repr Flags := inferInstanceAs (Repr (List Flag))

or find something which doesn't require something like this in the first place.

Mario Carneiro (Sep 03 2021 at 02:27):

cc: @Sebastian Ullrich , can we get this in lean core? Mathlib wants this too

Mac (Sep 03 2021 at 03:38):

@Mario Carneiro Isn't tis the same problem as discussed in the previous inferInstance Zulip thread where @Sebastian Ullrich said that "Ah, well, the current API doesn't quite allow for that yet".

Mario Carneiro (Sep 03 2021 at 03:38):

yes, this is a syntax extension

Mario Carneiro (Sep 03 2021 at 03:39):

The API issue might have to do with needing to use terms instead of class names

Mario Carneiro (Sep 03 2021 at 03:40):

but for this basic example it could just be

def Flags := List Flag
   deriving Repr

Leonardo de Moura (Sep 03 2021 at 18:37):

@Mario Carneiro I added basic support for deriving at def. We can now write

def Boo := List (String × String)
  deriving BEq, Repr, DecidableEq

def M := ReaderT String (StateT Nat IO)
  deriving Monad, MonadState, MonadReader

It is very simple for now. It only supports

Type "aliases".
Classes with a prefix of outParams followed by one parameter.

Mario Carneiro (Sep 03 2021 at 23:49):

@Leonardo de Moura Thanks!

Alcides Fonseca (Sep 08 2021 at 12:59):

A completely different take is to use an existing Java parser (https://javaparser.org or https://spoon.gforge.inria.fr) and create a tool that translates the parsed instances of the Java AST to a similar Java AST in Lean.

Last updated: May 02 2025 at 03:31 UTC