An example of using Java Reflection in a text parser. Mar 25th 2020


At the age of 81, and with a heart bypass, I figure that if I catch COVID 19
I'm a dead duck. So I'm going to publish what I've been working on for the
last 6 months. It's an approach to text analysis I've not seen elsewhere.
I know parsers are ten-a-penny, but do you know one that will handle matrices
in its input?

The idea is that you describe any parsable data input format in terms of valid
Java Classes, and a program called Scan performs the parsing by traversing the
class structure using Java reflection. Initialised char and String fields
describe the required constants and the other fields get filled in by Scan
with variable data from the input. Here is an example application:


class Shapes    // Example parser using Scan
{
        Shape[]shape;   // We expect the input to be an array of Shapes

public static void main(String[]args)   // sets the sample input data
{
        Shapes shapes=new Shapes("triangle([1,2],[3, 4 ],[5,6])line([7,8],[9,10])triangle([11,/* */ 12],[13,14],[15,16])");
}

Shapes(){}

Shapes(String data)
{
//      Scan.setTrace(true);            // When trace is true you get output every time a field is set
        new Scan(data,this);            // Call Scan with the input data and the Object (this) you want it to match
        System.out.println("");
        for(Shape s:shape)
         System.out.println(s+"");      // show the captured output
}

class Shape implements Choice           // 'Choice' says if a field fails to match the input, then the next field is tried
{
        Line line=null;                 // i.e. Line | Triangle
        Triangle triangle=null;

public String toString(){if(line==null)return(triangle+"");else return(line+"");}
}

class Line                              // I.e. line(pnt,pnt)  If any field fails then the whole construct fails
{
        String name="line";             // String and char fields must match
        char open='(';
        Pnt pnt0;                       // This field gets filled in with a 'Pnt'
        char comma=',';
        Pnt pnt1;
        char close=')';

public String toString(){return("Line("+pnt0+","+pnt1+")");}
}

class Triangle                          // I.e. triangle(pnt,pnt,pnt)
{
        String name="triangle";
        char open='(';
        Pnt pnt0;
        char comma=',';
        Pnt pnt1;
        char comma2=',';
        Pnt pnt2;
        char close=')';

// A 'success' callback can be put in any class. It's called if the class succeeds.
void success(String match){System.out.println("match="+match);}        // Just for illustration

public String toString(){return("Triangle("+pnt0+","+pnt1+","+pnt2+")");}
}

class Pnt                               // I.e. [x,y]   where x and y are any integers
{
        char open='[';
        int x;
        char comma=',';
        int y;
        char close=']';

public String toString(){return("["+x+","+y+"]");}
}

}


Output: The first 2 lines show the 'match' displayed in Triangle's success callback

match=triangle([1,2],[3,4],[5,6])
match=triangle([11,/* */ 12] , [13,14],[15,16])

Triangle([1,2],[3,4],[5,6])
Line([7,8],[9,10])
Triangle([11,12],[13,14],[15,16])


Reading from the top it says that the input should be an array of Shape(s).
Main sets the sample input. The constructor Shapes passes the data to Scan
along with 'this' which is the Object it must match. Of the next
two Classes, Shape and Line, Line shows the 'normal' use and each field is
scanned sequentially. If any field fails the the whole Object fails. The Shape
Class has the Choice marker which says the fields are alternatives. There are
other interface markers for Optional, and Not.

When initialised, String, StringBuffer, String[], char and char[] fields are
treated as required 'punctuation', however null(uninitialised) String and
String[] fields are treated as terminated Strings, that's to say any text up
to a terminator which is normally one of ",;)])" appearing outside nested
brackets. int, double, and other primitives plus classes are filled in from
the data. All Classes used in the scan must have no-argument constructors.
Fields which are private or protected are not scanned and give the user
working variables. The Triangle Class contains a 'success' callback which is
invoked after the successful scan. It gives the user access to the total
string matched by this Object.

The output shows the original blanks when 'match' is printed by the 'success'
method in Triangle, but these don't appear in the collected fields.

Other Examples: (These are all contained in the zip file scan.zip)


1) Points.java: This is exactly the same as Shapes except that Pnt is replaced
by java.awt.Point. A Point contains 2 modifiable fields, x and y, but of course
no comma between them. This sounds as though it could be very useful but unfortunately
rather few Java class have accessible fields (Insets, Rectangle, Dimension...).

2) Shr.java: This contains a String[] field 'name', and a char[] field 'chr'.
When these are encountered Scan requires the input to match one of the array
values. The user would clearly like to know which value matched and the static
variable Scan.index holds this index. But only till it gets overwritten
by the next array choice. You can get round this by either splitting the Assign
class into two with two success callbacks or, as here, by inserting a Method
between the two uses of index. A Method can be put anywhere and is not part of
the data scan. Instead it is called at the point it is (successfully)
encountered, i.e. it's a user-supplied callback. Scan provides a helper
routine method(Class cl,String name) which will create such a Method with a
single String parameter. Shr extends Scan but if it did not you would have to use
Scan.method (and Scan.index).

3) ExprTest.java: Expr is a builtin class. Rather than coding the enormous
complexity of a Java expression it cheats by assuming that {}, [], and () are
properly nested and, as with uninitialised Strings, that a top level , ; ) ]
or } terminates the expression. The terminators can be changed by setting
Scan.terminators. This seems to work surprisingly well.

4) Builtin.java, Staff.java: Other builtin classes are Atom (a text 'token'
like "level42" or "123" or ";"), Id (a java Identifier) and Mop (the
whitespace between atoms which maybe needed to recreate the format of the
input). other semi-builtin classes are ExprList which provides a match for an
optional comma-divided list of exprs. Staff shows the use of an uninitialised
String in an imaginary staff listing. This swallows all data up to and
including a terminator. The 'list' builtins like IdList have been modified to

class IdList                  // e.g.  a.b.contains  or java.awt.Point
{
        Id id;
        DotId[]dotid;

class DotId implements Optional
{
        char dot='.';
        Id id;
}
}

The array dotid calls an Optional construct, which should result in an instant
loop since DotId will never fail. However, this is such a common problem that
I've taken the decision for the array construct to fail if it finds that the
text pointer is not moving.

5) Types.java: This is a list of different data types to show the use of trace.
The data includes a Vector which, like ArrayList and List etc., is an
AbstractList. There is a serious problem with this form of data in that it
does not carry a ComponentType with it as arrays do. You may declare
Vector<String>v but the  type is only used at compile time.
Scan overcomes this by assuming that it is a List of Objects. This is not
foolproof since it recognises the Object based on a text String.
   1) boolean[]bool appears to be set twice. The first is instantiation, the
   second is when the array completes.
   2) The output at the bottom is from Scan.format which uses the class Out
   to format any Object in a hopefully human-readable way.

6) Base.java: This shows the simplest form of a Class scan using reflection.
It takes any source Object and produces a listing of the Classes and their
declared fields.

7) SB.java: This shows the use of StringBuffer(String target) and
StringBuffer[]. Scan searches the text for the next occurrence of target
ignoring atomisation and fails if it can't find one. For an array of
StringBuffers it find the earliest match and returns that (and the preceding
text) and sets Scan.index to show which matched. SB also shows the use of
Method m=Scan.check. This builtin callback (which can be used as often as you
like) gives a message which shows if it was reached successfully. The example
also shows the use of protected String match; This is a special case and is
not ignored like other protected fields but instead is filled in with the
current match at the point it's encountered, (Obviously the field is part of
the containing Class and could get in the way of programs that display or use
that Class.)

8) AtomObject.java: This shows the difference between scanning in an Atom
(always a String), and an Object which I've interpreted as either a Primitive
or a String (e.g. true is assumed to be Boolean, 123.4f is assumed Float).
This is rewriting the rules of Java to some extent because there exists no new
Object(String) constructor although at compile time Object[]obj={true,123.4f};
is interpreted in this way. (Please correct me if such a feature already
exists at runtime.) However these typed Objects are enormously useful since
your data can become a simple array of different types and if plotting it, for
example, you can interpret instanceof Double as a data point and "May" as a
date marker.

9) GOTCHA! Fails.java: Consider the following: Y and Z are inner classes of X:
class X{
class Y{}
class Z{}
}
If you attempt to scan an instance of Y, which happens to refer to an instance
of Z, then Z's instantiation will fail. This is because the instantiation of
an inner class needs an Object of its parent (in this case X). But Scan has no
access to X. You could put Z inside Y, but Z may also be referred to outside
Y. To get round this I've added to Scan a constructor which takes the data and
the source Object to be scanned as usual, plus an array of other Objects.
Include an instance of X, probably 'this', new Scan(data,new Y(),this), and
the problem goes away. This is illustrated in Fails.java.

10) Constr.java: This uses Annotation to add a marker, @construct(), to any
field to say it matches a 'construct' in the text. This has the advantage of
allowing you to create typed (non-string) data such a Date in the output.

11) Matrix.java: This shows preset arrays. Point[]p=new Point[2]; tells Scan
to find 2 points in the input. It also shows 9 input integers being turned
into a 3*3 matrix (int[][]m=new Int[3][3];)

Summary of Features

Tabulated by data type

As a form of syntax definition Scan input is not as brief as BNF but it
does have the merit that when you come back to an old program it's easy to see
what it's doing. It gathers the data analysed (which formally BNF does not)
and it resets changes when backtracks occur. It avoids the nightmare of
'indexOf' and 'substring'. I would expect it to be used for local formatted
 data, or for input extracted from HTML, for example.

I have used the system to implement a fairly ambitious macro extension to Java
which allows normal arrays to support the List methods add, set, remove,
resize, contains, and indexOf, e.g. String[]array={"a","b"}; array.add("c");

SERIOUS QUESTION: Can anyone think of a general application which requires
external non-text data to be captured in a Java Class structure?

Comments? Email Chris Paradine

website statistics