Deriving Linguistic Resources from Treebanks
Deriving Linguistic Resources from Treebanks
This page under construction.
Welcome to the Dublin-Essex-Saarbrücken Treebank project! You are visitor number
since we started counting on September 27th 1999!
We have two papers accepted for ACL-04 in Barcelona, in July.
We invite you to revisit this page since we are
adding new material all the time. Recent additions (beg. August 2001) include copies of all our papers written so far. We are also very pleased to be able to announce that Josef
Van Genabith and Andy Way were recently awarded £ 100,000 from Enterprise Ireland under their Basic Research scheme to carry out further research in this area for the next 3 years. Two postgraduate students Aoife Cahill and Mairead McCarthy started their programmes of study in October 2001.
Introduction
This page describes work done by Josef
Van Genabith, Louisa
Sadler, Anette Frank and Andy Way on manipulating Treebanks to develop other
linguistic resources.
Probabilistic Unification Grammars (e.g. LFG-DOP: Bod and Kaplan, 1998) require large, high quality training
corpora. These corpora have to provide tree structures with feature structure annotations.
Such corpora are expensive to construct and hard to come by. The traditional procedure for
constructing such corpora is to use a large-scale unification grammar (in the real world, this often means writing one yourself!) and parse
text. Typically for each string in the input text the grammar will produce hundreds or
thousands of candidate tree-feature structure pairs from which a highly trained linguist has
to pick the best analysis for inclusion in the training corpus. This is time consuming and
error prone. We have developed an alternative method. The basic idea is extremely simple.
As input our method requires a treebank. From this we automatically compile the CF-PSG
following the method of [Charniak,96]. We then manually annotate the CF-PSG with
f-structure equations and provide macros for the lexical categories. Then (and this is the
trick) we "reparse" the treebank entries (not the strings) simply following the annotations
put in there by the original human annotators and while we do that solve the f-equations on
the rules encountered in that process. This results in an f-structure induced by the
best-fitting tree for the example at hand. If the f-structure annotations are deterministic,
then the whole process is and we do not have to choose from hundreds or thousands of
alternatives.
Papers
Papers have been presented at LFG-99,
at the EACL Workshop
on Linguistically Interpreted Corpora, at the ATALA Workshop on Treebanks, at LFG-2000, at LFG-2001, at LREC-2002, at LFG-2002, and at Treebanks and Linguistic Theories.
- LFG-1999 (full paper (in html), psnup-ed postscript version) (438K)
- This inaugural paper sets out our initial ideas on automatic compilation on Semi-Automatic Generation of F-Structures from Treebanks, together with a number of insights from a set of experiments.
- EACL Workshop on Linguistically Interpreted Corpora (.ps, abstract (html))
- This paper builds on the LFG-99 paper by adding to the annotated grammars produced there in order to permit LFG analyses of strings, rather than treebank trees. This requires adding semantic forms (subcategorisation frames) and grammatical checking (completeness and coherence). This paper shows how semantic forms may be compiled automatically from the LFG-annotated treebanks produced in the LFG-99 paper.
- ATALA Workshop on Treebanks (.pdf, .ps, abstract (html))
- The main work involved in the approaches outlined in the two previous papers was the manual annotation of the CFGs extracted automatically from the treebank. In this paper we begin our investigation of writing annotation principles which are then applied to the extracted CFG to automatically annotate the grammar with LFG functional schemata. A second approach which operates directly on constraint set encodings of treebank trees with direct annotation with f-structure information is also presented. We give results for Precision and Recall for the automatic annotation compared to the `gold standard' reference grammar constructed manually, as well as a second experiment for the alternative approach presented. This new work has the potential to considerably reduce the manual overload, and huge savings in time and cost for treebank annotation and grammar development.
- LFG-2000 (.pdf, .ps, abstract (html))
- This paper explains the first methodology outlined in the ATALA paper in much more detail, based on the original ideas in the LFG-99 paper. Interested parties may wish to consult the companion paper (.ps) which outlines in more detail than the ATALA paper our alternative approach.
- LFG-2001 (.ps, abstract (html))
- This paper examines Treebank vs. X-BAR based Automatic F-Structure Annotation. Treebank trees and CFGs extracted from treebank resources do not tend
to follow strongly hierarchical and recursive X-BAR design principles
but instead feature a large number of rather flat trees and CFG rules. Grammars conforming more to X-BAR principles should allow better generalisations to be drawn in our annotation principles. We examine how treebank grammars may be ported automatically to X-BAR grammars, and provide some analysis of how the automatic annotation process compares with both types of grammar.
- LREC-2002 (.ps)
- This paper presents a new automatic annotation method that scales and
has been applied to a complete treebank, the WSJ section
of Penn-II, with more than 1,000,000 words in about 50,000 sentences.
- LFG-2002 (Slides (.ppt) - paper not currently available)
- Treebanks and Linguistic Theories 2002 (.ps)
- This paper presents a number of automatic evaluation
methodologies for assessing the effectiveness of the techniques we
have developed. They include quantitative and qualitative
metrics.Quantitative metrics do not involve a `gold standard', while
qualitative metrics do. For
the quantitative evaluation, we demonstrate the coverage of our
annotation algorithm with respect to rule types and tokens, and we
also provide details of fragmentation, as well as annotation failure
where a set of unresolvable descriptions results in no unified
f-structure being produced. The qualitative measures compare the
f-structure annotations generated by our automatic annotation
procedure against those contained in a manually constructed `gold
standard' set of f-structures. 105 trees from section 23 of the
Wall Street Journal part of the treebank were
randomly selected and manually annotated with f-structure descriptions. We then
use two measures to compare the automatically generated set of
equations against the gold standard set: firstly, we use
the standard evalb test on the annotated trees, and secondly
calculate Precision and Recall on the flat set descriptions of
f-structures generated.
Downloadable Results
The publicly available subset of the AP Treebank consists of 100 sentences of newswire reports.
These are available here as:
We are beginning to make available the f-structures we have derived semi-automatically
from the AP Treebank, developed at Lancaster.
Again, there are several sets of f-structures:
- those derived from a deterministic grammar:
- those derived from a non-deterministic grammar (where a choice of f-structure is available for a sentence, we have selected here what we consider to be the better one)
- those derived from a collapsed (i.e. generalised) grammar (not yet available)
These will be made available in various other formats, including Latex, for optimal reusability.
Other potentially useful resources we have developed include:
- the lexicon (see the LOB Tagset for a description of the tags used)
- a file showing which words and rules are used in which sentences
- some LFG semantic forms (i.e. subcategorisation frames) derived from:
- a (very slow!!) DCG derived from the grammar rules
- a Probabilistic Grammar derived from the grammar rules
Once we are satisfied that we have finished our work, we intend to make the grammars themselves available.
Other, newer
results include work on automatically annotating the Penn II Treebank
with LFG functional annotations. We reported on this work at the LREC-2002 workshop on
Linguistic Knowledge Acquisition and Representation: Bootstrapping
Annotated Language Data, and at LFG-2002. We are making
the current, draft resources available:
Finally, you may wish to access some other treebanks. Some of the better known ones include:
Some of you may be interested in a new treebank project site set up in Paris, as well as a new book on Treebanks.
Andy Way. Last edited:31st July, 2002