Index of /LNF/i386/5.10/LNFgate/reloc/gate-3.1b2270/plugins/Montreal_Transducer

Icon  Name                                                Last modified      Size  Description
[PARENTDIR] Parent Directory - [DIR] src/ 2007-04-21 05:54 - [TXT] build.xml 2006-04-05 15:11 1.9K Ant build file [TXT] creole.xml 2006-04-05 15:11 1.4K [TXT] LICENCE.html 2006-04-05 15:12 28K [TXT] README.html 2006-04-05 15:12 22K [   ] MtlTransducer.jar 2007-04-21 05:23 109K Java Archive
README

The Montreal Transducer module for GATE

User guide

Copyright Luc Plamondon, Université de Montréal, 2004.
plamondl@iro.umontreal.ca
$Id$


Table of contents

  1. What is GATE
  2. What is the Montreal Transducer?
  3. Getting help
  4. Installation procedure
  5. How to use it with the GATE GUI?
  6. How to use it in a standalone GATE program?
  7. Changes to the JAPE language
  8. For developers
  9. Licence
  10. Change log

1) What is GATE?

GATE is a development environment for language engineering. It is open source and it can be downloaded from http://gate.ac.uk. The processing of a document is divided into small tasks that are performed by independent JavaBeans modules. The Montreal Transducer is one of those modules.

2) What is the Montreal Transducer?

A transducer has 2 inputs: a document and a human-readable grammar. Generally, the output is a document with annotations added according to the grammar, but it could be anything else because the grammar allows Java code to be executed upon the parsing of a rule. A transducer can be used to identify named entities in a document, for example.

The GATE framework comes with a basic "Jape Transducer" which is fully described in the Gate user guide. The JAPE grammar language understood by the transducer is also explained. There is also an "Ontology Aware Transducer" that is a wrapper around the Jape Transducer (in fact, the latter's core is already ontology aware). And there is a "ANNIE Transducer" that is nothing more than a Jape Transducer that loads with a named-entity recognition grammar.

The Montreal Transducer is an improved Jape Transducer. It is intended to make grammar authoring easier by providing a more flexible version of the JAPE language and it also fixes a few bugs.

If you write JAPE grammars, see section Changes to the JAPE language for all the details.  Otherwise, here is a short description of the enhancements:

a) The improvements

b) The bugs fixed

({Lookup.majorType == title})+:titles ({Token.orth == upperInitial})*:names

3) Getting help

The reader should be familiar with the Jape language. See the Gate user guide, more specifically section JAPE: Regular Expressions Over Annotations and appendix JAPE: Implementation.

The Montreal Transducer sources are freely available, so user support will be very limited.  You may find what you are looking for on the project homepage.

Developers will find comments on classes and methods through the javadoc pages: doc/javadoc/index.html.

4) Installation procedure

Java 1.4 or higher is required. The Montreal Transducer has been tested on GATE 2.1, 2.2 and 3.0. If you are using GATE 2.x, put the MtlTransducer.jar and creole.xml files in any directory (as long as they are in the same directory). If you are using GATE 3.0, put the 2 files in your plugin directory (more about plugins in the Gate user guide, section Use (CREOLE) Plug-ins).

Note that the directory must be accessible by the embedding application via the "file:" protocol. Unlike for most GATE modules, the directory (also known as a repository in GATE 2.x) of a transducer cannot be a web URL ("http://www..."). This is because the transducer compiles java code (the grammar rules) every time it is loaded and the resource jar file must be part of the classpath when compiling, but only regular file URLs are allowed in the classpath. The resource will try to add the jar file to the classpath automatically.

If problems arise when loading the transducer, add the jar file to the classpath manually prior to running the application.

If you plan to use the transducer with the GATE GUI, see section How to use it with the GATE GUI. If you plan to use it in a standalone program, jump to section How to use it in a standalone GATE program.

5) How to use it with the GATE GUI

Gate 2.x: In the GUI menu, click on File / Load a CREOLE Repository, then enter the URL of the directory where MtlTransducer.jar and creole.xml files live. The path must begin with "file:". It cannot be a web URL (see Installation procedure).

Gate 3.0: In the GUI menu, click on File / Manage CREOLE plugins, find the Montreal Transducer and tick the "Load now" or "Load always" box.

Then, for all versions of GATE: Click on File / New processing resource and choose Montreal Transducer. The only mandatory field is the Grammar URL: enter the path of a main.jape file in the same manner as for a regular Jape Transducer (this URL can point to a file on the web). Add the new module to a processing pipeline. It may be necessary to run a tokeniser and gazetteer before the transducer if the grammar uses Token and Lookup annotations.

6) How to use it in a standalone GATE program?

Note: this section was written for GATE 2.x. If you are using GATE 3.0, repository management (setting the plugin directory) may work differently.

A good starting point is the example code here. The following code registers a repository (the directory where the MtlTransducer.jar and creole.xml files live; the directory cannot be a web URL, see Installation procedure), then creates a Montreal Transducer with specific parameters (the grammarURL parameter is mandatory and it should point to a main.jape file like for a regular Jape Transducer), and finally adds the resource to a pipeline. It may be necessary to run a tokeniser and gazetteer before the transducer if the grammar uses Token and Lookup annotations.

// Create a pipeline
SerialAnalyserController annieController = (SerialAnalyserController) Factory.createResource("gate.creole.SerialAnalyserController",
   Factory.newFeatureMap(), Factory.newFeatureMap(), "ANNIE_" + Gate.genSym());

// Load a tokeniser, gazetteer, etc. here

// Register the external repository where the Montreal Transducer jar file lives
gate.Gate.getCreoleRegister().registerDirectories(new URL("file:MtlTransducer/build"));

// Create an instance of the transducer after having set the grammar URL
FeatureMap params;
params = Factory.newFeatureMap();
params.put("grammarURL", new URL("file:creole/NE/main.jape"));
params.put("inputASName", "Original markups");
ProcessingResource transducerPR = (ProcessingResource)
Factory.createResource("ca.umontreal.iro.rali.gate.MtlTransducer", params);
annieController.add(transducerPR);

7) Changes to the JAPE language

The Montreal Transducer is based on the Transducer from the ANNIE suite but with the following added features:


More comparison operators

The Montreal Transducer offers more comparison operators to put in left hand side constraints of a JAPE grammar. The standard ANNIE transducer allows constraints only like these:

The Montreal Transducer allows the following constraints: See the notes on the equality operators, comparison operators, pattern matching operators and negation operator.

Notes on equality operators: "==" and "!="

The "!=" operator is the negation of the "==" operator, that is to say: {Annot.attribute != value} is equivalent to {!Annot.attribute == value}.

When a constraint on an attribute cannot be evaluated because an annotation does not have a value for the attribute, the equality operator returns false (and the difference operator returns true).

If the constraint's attribute is a string, then the String.equals method is called with the annotation's attribute as a parameter. If the constraint's attribute is an integer, then the Long.equals method is called. If the constraint's attribute is a float, then the Double.equals method is called. And if the constraint's attribute is a boolean, then the Boolean.equals method is called. The grammar parser does not allow other types of constraints.

Normally, when the types of the constraint's and the annotation's attribute differ, they cannot be equal. However, because some ANNIE processing resources (namely the tokeniser) set all attribute values as strings even when they are numbers (Token.length is set to a string value, for example), the Montreal Transducer can convert the string to a Long/Double/Boolean before testing for equality. In other words, for the token "dog":

Notes on comparison operators: ">", "<", ">=" and "<="

If the constraint's attribute is a string, then the String.compareTo method is called with the annotation's attribute as a parameter (strings can be compared alphabetically). If the constraint's attribute is an integer, then the Long.compareTo method is called. If the constraint's attribute is a float, then the Double.compareTo method is called. The transducer issues a warning if an attempt is made to compare two Boolean because this type does not extend the Comparable interface and thus has no compareTo method.

The transducer issues a warning when it encounters an annotation's attribute that cannot be compared to the constraint's attribute because the value types are different, or because one value is null. For example, given a constraint {MyAnnot.attrib > 2}, a warning is issued for any MyAnnot in the document for which attrib is not an integer, such as attrib = "dog" because we cannot evaluate "dog" > 2. Similarly, {MyAnnot.attrib > 2} cannot be compared to attrib = 2.5 because 2.5 is a float. In this case, force 2 as a float with {MyAnnot.attrib > 2.0}.

The transducer does not issue a warning when the constraint's attribute is an integer/float and the annotation's attribute is a string but can be parsed as an integer/float. Some ANNIE processing resources (namely the tokeniser) set all attribute values as strings even when they are numbers (Token.length is set to a string value, for example), and because {Token.length < "10"} would lead to an alphabetical comparison, a workaround was needed so we could write {Token.length < 10}.

Notes on pattern matching operators: "=~" and "!~"

The "!~" operator is the negation of the "=~" operator, that is to say: {Annot.attribute !~ "value"} is equivalent to {!Annot.attribute =~ "value"}.

When a constraint on an attribute cannot be evaluated because an annotation does not have a value for the attribute, the value defaults to an empty string ("").

The regular expression must be enclosed in double quotes, otherwise the transducer issues a warning:

The regular expression must be a valid java.util.regex.Pattern, otherwise a warning is issued.

To have a match, the regular expression must cover the entire attribute string, not only a part of it. For example:

Notes on the negation operator: "!"

Bindings: when a constraint contains both negated and regular elements, the negated elements do not affect the bindings of the regular elements. Thus, {Person, !Organization} binds to the same annotations (amongst those that starts at current node in the annotation graph) as {Person}; the difference between the two is that the first will simply not match if one of the annotations starting at current node is an Organization. On the other hand, when a constraint contains only negated elements such as {!Organization}, it binds to all annotations starting at current node. It is important to keep that in mind especially when a rule ends with a constraint with negated elements only: the longest annotation at current node will be preferred.

Conjunctions of constraints on different types of annotation

The Montreal Transducer allows constraints on different types of annotation. Though the JAPE implementation exposed in the GATE 2.1 User Guide details an algorithm that would allow such constraints, the ANNIE transducer does not implement it. This transducer does. Those examples do not work as expected with the ANNIE transducer but do with this transducer:

As described in the algorithm, the first example above matches points in the document (or nodes in the annotation graph) where both a Person and an Organization annotations begin, even if they do not end at the same point in the document and even if other annotations begin at the same point. When a negation is involved, such as in the third example above, no annotation of that kind must begin at a given point for a match to occur (see the note on the negation operator below).

Greedy Kleene operators: "*" and "+"

The ANNIE transducer does not behave consistently regarding the "*" and "+" Kleene operators. Suppose we have the following rule with 2 bindings:

Given the sentence "the Honourable Mr. John Atkinson", we expect the following bindings: But the ANNIE transducer could give something like: This is not incorrect, but according to convention, "*" and "+" operators match as many tokens as possible before moving on to the next constraint. The Montreal Transducer guarantees that "*" and "+" are greedy.
 

8) For developers

Developers will find comments on classes and methods through the javadoc pages: doc/javadoc/index.html. Most of the source code comes from the Jape Transducer in GATE. It was necessary to copy entire packages instead of overriding a few methods because many class attributes and members were not accessible outside the gate.xxx package. The Montreal Transducer needs 4 packages:

a) ca.umontreal.iro.rali.gate.creole

Contains only the MtlTransducer class, which is the module's interface with the outside world. The MtlTransducer class is almost exactly the same as gate.creole.Transducer (the basic Jape Transducer). The code of OntologyAwareTransducer is also included in MtlTransducer. It was impossible to simply extend any of those transducers because some members are private or package-protected.

b) ca.umontreal.iro.rali.gate.fsm

Same as the gate.fsm package. This package models the grammar as a finite state machine. Only the convertComplexPE private method of the FSM class has been substantially modified.

c) ca.umontreal.iro.rali.gate.jape

Almost the same as the gate.jape package. Significant modifications were made to the SinglePhaseTransducer, Constraint and JdmAttribute classes.

d) ca.umontreal.iro.rali.gate.jape.parser

Almost the same as gate.jape.parser package. Modifications were made to ParseCpsl.jj so that the JAPE language could be extended. This file is to be compiled with javacc. The other classes of the package are automatically generated by javacc.

9) Licence

This work is a modification of some GATE libraries and therefore the binaries and source code are distributed under the same licence as GATE itself. GATE is licenced under the GNU Library General Public License, version 2 of June 1991. That licence is distributed with this module in the file LICENCE.htm. GATE binaries and source code are available at http://gate.ac.uk. Modifications to the original source code are detailed in the header of each file.

Basically, the Montreal Transducer source code and binaries are free. A work that would be a modification of it should also be free. However, a work that would only USE the Montreal Transducer would be exempted from the terms of the licence, provided the GATE and the Montreal Transducer binaries, source code and licence are distributed with the embedding work and provided the use of those softwares is acknowledged. For additional help on the interpretation of the GATE licence, see http://www.gate.ac.uk/gate/doc/index.html.

10) Change log

1.2:
- Updated documentation to address GATE 3.0 plugin management.

1.1:
- Bug fixed: a constraint with multiple negated tests on the same attribute of a given annotation type would match when at least one test succeeds, but it should match only when ALL negated tests succeed.

1.0:
- Initial release.