TREEBANK UTILS SYNOPSIS A few commands useful when dealing with trees in Penn Treebank format. Input is assumed to be UTF-8. Output is UTF-8. If you would like more commands implemented, bug me. INSTALLATION There is an executable compiled for Linux x86. If you need to compile yourself, read on. BUILDING You will need a recent version of the GHC Haskell compiler. The procedure is as follows: runghc Setup.hs configure --prefix=PREFIX (e.g. $HOME) runghc Setup.hs build runghc Setup.hs install The file PREFIX/bin/treebank is the main executable. Usage: treebank command [OPTION...] print: print trees in specified format --format=line|page tree format --wrap=string category of top-level node to wrap parse tree in --range=START:NUMBER tree range starting at START and containing the following NUMBER trees --max-len=number maximum length of yield of trees to output --lgs repair LGS: move LGS tags from PP to NP nodes --sort sort trees by their yields lexicographically --token=charniak|bikel output tokens only in specified format --fix-formatting try fixing common formatting errors --add-lemmas=string append lemmas to word forms using 'string' as separator - this option requires the --lemma-file option --lemma-file=path path to file with lemmas split: split n-best list into separate files --file=FILE-PATH path to n-best file -n INTEGER --n=INTEGER n in n-best list nbest-head: display first n sentences from charniak nbest output span-overlap: calculate precision, recall and f-score of node-spans between trees node-count: print total number of non-empty nodes in trees read from stdin --phrasal only count non-terminal nodes --skip-root don't count root node Examples: treebank split --n=100 --file=tree-file treebank print --format=page --range=1-10 < tree-file treebank print --add-lemmas=# --lemma-file=path/to/lemma/file < tree-file treebank print --token-format=charniak < tree-file CONTACT Grzegorz ChrupaƂa National Center for Language Technology (L2.08) School of Computing, Dublin City University Glasnevin, Dublin 9 Ireland, EU http://computing.dcu.ie/~gchrupala Phone: +353 1 700 6913