- Tuning Statistical Machine Translation on character sequences. An extension to SMT toolkits to tune on character sequences (rather than the default word sequences as used by e.g. BLEU) by adding the chrF1 evaluation metric. Available as part of the Joshua SMT toolkit. More details in our WMT16 paper.
- DELiC4MT. A software tool that allows to perform diagnostic evaluation of Machine Translation systems over linguistic checkpoints. Visit its website.
- Webservices for aligners.
Source code of webservices for several sentential and subsentential
aligners and workflows. Details can be found on my paper "Towards a
User-Friendly Webservice Architecture for Statistical Machine
Translation in the PANACEA project (EAMT 2011)". [tgz].
- Catalan<->Italian Machine Translation system. Collaboration with the
Apertium project to create a
translator for the language pair Catalan-Italian by exploiting the
translators for the pairs Spanish<->Catalan and Spanish<->Italian. You can
download the latest data from
Older stuff (warning: not actively maintained!)
- Linking Wikipedia categories to Wordnet synsets. A set of
polysemous nouns from WordNet 2.1 which are mapped to Wikipedia
categories. The disambiguation task should then identify,
for each noun, which of its senses, if any, corresponds to the mapped/s
category/ies. Download the evaluation data
- tfidfwrap provides a TF-IDF C++ class by wrapping tfidf
(Tf-idf library in python, http://code.google.com/p/tfidf/) which:
"constructs an IDF corpus and stopword list either
from documents specified by the client, or by reading from input files.
It computes IDF for a specified term based on the corpus, or generates
keywords ordered by tf-idf for a specified document".
Download the source [tgz]
and the corpus and stopword TF-IDF files generated from the English
(dump from January 2008) [tgz],
the Italian Wikipedia [tgz]
and the Spanish Wikipedia [tgz].
- wiki_db_access (C++ Wikipedia API). A free software (GPL
licensed) package that
includes a C++ API (tested under GNU/Linux and Win32) to
access Wikipedia in DataBase format and
utilities to download and import the required data. Download the source
- Manually disambiguated mappings between WordNet 1.5 and
WordNet 3.0: a set of manually disambiguated mappings as a result
of the upgrading of Inter-Lingual connections of the Italian WordNet
from version 1.5 to 3.0.
Download the mappings [tgz].
- DRAMNERI. A free software
(GPL licensed) application to Named
Entity Recognition (NER). This is a knowledge-based and
customizable tool to perform NER. It is fully documented and has been
succesfully tested under GNU/Linux and Win32 although it should work in
any platform where a C++ compiler with STL support is available.
Download the source with documentation and examples [tgz],
win32 binaries [tgz]
- WinDRAMNERI. A freeware
Spanish frontend for
DRAMNERI (v.0.2.x) which runs under Win32 platforms. It may need
additional DLL and OCX files (the application will print a message
asking for them if so). Developed and contributed by Carlos
Leonel Chinchilla Calvo (clchinchilla(--at--)gmail.com). Download the
- Tagged entries of the Simple
English Wikipedia. 3517 randomly
selected entries manually tagged with NER categories (NONE, LOC, ORG,
PER). Download as a compressed plain text file [tgz].