2008年8月31日星期日

Part of Speech Taggers

Freely downloadable
Stanford POS tagger
Loglinear tagger in Java (by Kristina Toutanova)
MBT: Memory-based Tagger
Based on TiMBL
TreeTagger
A decision tree based tagger from the University of Stuttgart (Helmut Scmid). It's language independent, but comes complete with parameter files for English, German, Italian, Dutch, French, Old French, Spanish, Bulgarian, and Russian. (Linux, Sparc-Solaris, Windows, and Mac OS X versions. Binary distribution only.) Page has links to sites where you can run it online.
SVMTool
POS Tagger based on SVMs (uses SVMlight). LGPL.
ACOPOST (formerly ICOPOST)
Open source C taggers originally written by by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.
MXPOST: Adwait Ratnaparkhi's Maximum Entropy part of speech tagger
Java POS tagger. A sentence boundary detector (MXTERMINATOR) is also included. Original version was only JDK1.1; later version worked with JDK1.3+. Class files, not source.
fnTBL
A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.
mu-TBL
An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager. Web demo also available. Prolog.
YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
QTAG Part of speech tagger
An HMM-based Java POS tagger from Birmingham U. (Oliver Mason). English and German parameter files. [Java class files, not source.]
The TOSCA/LOB tagger.
Currently available for MS-DOS only. But the decision to make this famous system available is very interesting from an historical perspective, and for software sharing in academia more generally. LOB tag set.
Brill's Transformation-based learning Tagger
A symbolic tagger, written in C.
Original Xerox Tagger
A common lisp HMM tagger available by ftp.
Lingua-EN-Tagger
Perl POS tagger by Maciej Ceglowski and Aaron Coburn. Version 0.11. (A bigram HMM tagger.)
Free, but require registration
TATOO
The ISSCO tagger. HMM tagger. Need to register to download.
PoSTech Korean morphological analyzer and tagger
Online registration.
TnT - A Statistical Part-of-Speech Tagger
Trainable for various languages, comes with English and German pre-compiled models. Runs on Solaris and Linux.
Usable by email or on the web, but not distributed freely
Memory-based tagger
From ILK group, Catholic University Brabant (Jakub Zavrel/Walter Daelemans). Does Dutch, English, Spanish, Swedish, Slovene. Other MBL demos are also available.
Birmingham tagger
Accepts only plain ASCII email message contents. The tagset used is similar to the Brown/LOB/Penn set.
CLAWS tagger
The UCREL CLAWS tagger is available for trial use on the web. (It's limited to 300 words though -- this site is more of an advertisement for licensing the real thing -- available as software for Suns or as a paid service.) You can also find info on CLAWS tagsets, though that page doesn't seem to link to the C7 tagset.
The AMALGAM tagger
The AMALGAM Project also has various other useful resources, in particular a web guide to different tag sets in common use. The tagging is actually done by a (retrained) version of the Brill tagger (q.v.).
Xerox XRCE MLTT Part Of Speech Taggers
Tags any of 14 languages (European and Arabic), online on the web.
Portuguese taggers on the web: Projecto Natura and a QTAG adaptation.
Not free
Lingsoft
Lingsoft in Finland has (symbolic) analysis tools for many European languages. More information can be obtained by emailing info@lingsoft.fi. There is an online demo.
Conexor
Conexor in Finland has demonstrations of EngCG-style taggers and parsers, for English, Swedish, and Spanish.
Xerox
Xerox has morphological analyzers and taggers for many languages. There are demos of some of their tools on the web. More information can be obtained by contacting Daniella Russo.
Infogistics
Infogistics, an Edinburgh spinoff has a tagging and NP/Verb group chunker available commercially, including an evaluation version.
No longer available
LT POS and LT TTT
The Edinburgh Language Technology Group tagger and text tokenizer (and sentence splitter were binary-only Solaris tools which no longer seem to be available.