2008年7月18日星期五
Accepted papers of ACM CIKM 2006
November 6-11, 2006
http://sa1.sice.umkc.edu/cikm2006/
Sheraton Crystal City Hotel Arlington, VA 22202
Sponsored by: ACM SIGIR, and SIGWEB
*****************************************************************************
Number of submissions: 537
We accepted 15% as full papers and 10% as poster papers.
Accepted full papers: 81
Accepted poster papers: 56
Accepted Full Papers
---------------------------
Automatic Computation of Semantic Proximity Using Taxonomic Knowledge
Cai-Nicolas Ziegler
Kai Simon
Georg Lausen
Secure search in enterprise webs: tradeoffs in efficient implementation for
document level security
Peter Bailey
David Hawking
Brett Matson
Performance Thresholding in Practical Text Classification
Hinrich Schuetze
Efficient Model Selection for Regularized Linear Discriminant Analysis
Jieping Ye
Tao Xiong
Qi Li
Ravi Janardan
Jinbo Bi
Vladimir Cherkassky
Chandra Kambhamettu
Ranking Web Objects from Multiple Communities
Le Chen
Lei Zhang
Feng Jing
Ke-Feng Deng
Wei-Ying Ma
Window Join Approximation over Data Streams with Importance Semantics
Qiang Zhu
Wen Chi Hou
Adegoke Ojewole
Ranking Robustness: A Novel Framework to Predict Query Performance
Yun Zhou
W. Bruce Croft
Annotation Propagation Revisited for Key Preserving Views
Floris Geerts
Gao Cong
Wenfei Fan
Text Classification Improved through Multigram Models
Dou Shen
Improving Novelty Detection for General Topics Using Sentence Level
Information Patterns
Xiaoyan Li
An Integer Programming Approach for Frequent Itemset Hiding
Vassilios Verykios
Aris Gkoulalas-Divanis
Vector and Matrix Operations Programmed with UDFs in a Relational DBMS
Carlos Ordonez
Javier Garcia-Garcia
Exploiting Asymmetry in Hierarchical Topic Extraction
Sreenivas Gollapudi
Rina Panigrahy
3DString: A Feature String Kernel for 3D Object Classification on Voxelized
Data
Karsten Borgwardt
Johannes Aßfalg
Hans-Peter Kriegel
KDDCS: A Load-Balanced In-Network Data-Centric Storage Scheme for Sensor
Networks
Mohamed Aly
Kirk Pruhs
Panos K. Chrysanthis
Efficient Processing of Complex Similarity Queries in RDBMS through Query
Rewriting
Caetano Traina
Agma Traina
Marcos Rodrigues Vieira
Adriano Siqueira Arantes
Christos Faloutsos
Movie review mining and summarization
Li Zhuang
Feng Jing
Xiao-Yan Zhu
Lei Zhang
Capturing Community Search Expertise for Personalized Web Search using
Snippet-Indexes
Oisin Boydell
Barry Smyth
Validating Associations in Biological Databases
Francisco M Couto
Pedro M Coutinho
Mario Silva
Query Optimization using Restructured Views
Rada Chirkova
Fereidoon Sadri
Estimating Corpus Size via Queries
Ravi Kumar
Andrew Tomkins
Andrei Broder
Marcus Fontoura
Vanja Josifovski
Rajeev Motwani
Ying Xu
Rina Panigrahy
Shubha Nabar
Concept Similarity Mining without Frequency Information from Domain
Describing Taxonomies
Jong Wook Kim
K. Selcuk Candan
In Search of Meaning for Time Series Subsequence Clustering: Matching
Algorithms Based on a New Distance Measure
Dina Goldin
Ricardo Mardales
George Nagy
Distributed Spatio-Temporal Similarity Search
Demetrios Zeinalipour-Yazti
Song Lin
Dimitrios Gunopulos
Concept Frequency Distribution in Biomedical Text Summarization
Lawrence Reeve
Hyoil Han
Saya V. Nagori
Jonathan Yang
Tami Schwimmer
Ari D. Brooks
Mining Compressed Commodity Workflows From Massive RFID Data Sets
Hector Gonzalez
Jiawei Han
Xiaolei Li
Topic Evolution and Social Interactions: How Authors effect Research
Ding Zhou
Xiang Ji
Hongyuan Zha
C. Lee Giles
Improve query I/O performance by permuting and refining block request
sequences
Xiaoyu Wang
Mitch Cherniack
Mining Blog Stories Using Community-based and Temporal Clustering
Arun Qamra
Belle Tseng
Edward Chang
Concept-based Document Readability in Domain Specific Information Retrieval
Xin Yan
Dawei Song
Xue Li
Privacy Preserving Sequential Pattern Mining In Distributed Databases
Vishal Kapoor
Pascal Poncelet
Francois Trousset
M. Teisseire
Cache-Oblivious Nested-Loop Joins
Bingsheng He
Qiong Luo
A Dictionary for Approximate String Search and Longest Prefix Search
Sreenivas Gollapudi
Rina Panigrahy
Discovering and Exploiting Keyword and Attribute-Value Co-occurrences to
Improve P2P Routing Indices
Matthias Bender
Sebastian Michel
Nikos Ntarmos
Peter Triantafillou
Gerhard Weikum
Christian Zimmer
A Document-Centric Approach to Static Index Pruning in Text Retrieval
Systems
Stefan Buettcher
Charles Clarke
Effective and Efficient Classification on a Search-Engine Model
Kunal Punera
Aris Anagnostopoulos
Andrei Broder
SaLSa: Computing the Skyline without Scanning the Whole Sky
Marco Patella
Ilaria Bartolini
Paolo Ciaccia
Query Result Ranking over E-commerce Web Databases
Weifeng Su
Adaptive Non-linear Clustering in Data Streams
Ankur Jain
Zhihua Zhang
Edward Chang
Constrained Subspace Skyline Computation
Evangelos Dellis
Ilya Vladimirskiy
Bernhard Seeger
Yannis Theodoridis
Akrivi Vlachou
An Approximate Multi-Word Matching Algorithm for Robust Document Retrieval
Atsuhiro Takasu
Document Re-ranking Using Cluster Validation and Label Propagation
Lingpeng Yang
Donghong Ji
Guodong Zhou
Classification spanning correlated data streams
Rong She
Yabo Xu
Ke Wang
Jian Pei
Processing Relaxed Skylines in PDMS Using Distributed Data Summaries
Katja Hose
Christian Lemke
Kai-Uwe Sattler
Structure-Based Querying of Proteins Using Wavelets
Parthasarathy Srinivasan
Keith Marsolo
Kotagiri Ramamohanarao
A combination of trie-trees and inverted files for the indexing of
set-valued
Manolis Terrovitis
Spyros Passas
Panos Vassiliadis
Timos Sellis
Efficient Join Processing over Uncertain Data
Sarvjeet Singh
Reynold Cheng
Yuni Xia
Sunil Prabhakar
Rahul Shah
Jeffrey Vitter
Optimisation methods for ranking functions with multiple parameters
Stephen Robertson
Michael Taylor
Hugo Zaragoza
Nick Craswell
Chris Burges
On GMAP
Stephen Robertson
A Probabilistic Relevance Propagation Model for Hypertext Retrieval
Azadeh Shakery
Chengxiang Zhai
Summarizing Local Context to Personalize Global Web Search
Paul - Alexandru Chirita
Wolfgang Nejdl
Claudiu Firan
Describing Differences between Databases
Heiko Müller
Johann-Christoph Freytag
Ulf Leser
Voting for Candidates: Adapting Data Fusion Techniques for an Expert Search
Task
Craig Macdonald
Iadh Ounis
Finding Highly Correlated Pairs Efficiently with Powerful Pruning
Jian Zhang
Joan Feigenbaum
Investigating the Exhaustivity Dimension in Content-Oriented XML Element
Retrieval Evaluation
Paul Ogilvie
A Study on the Effects of Personalization and Task Information on Implicit
Feedback Performance
Ryen White
Diane Kelly
POLESTAR - Collaborative Knowledge Management and Sensemaking Tools for
Intelligence Analysts
Nicholas Pioch
John Everett
Incremental Hierarchical Clustering of Text Documents
Nachiketa Sahoo
Jamie Callan
Ramayya Krishnan
George Duncan
Rema Padman
Multi-Evidence, Multi-Criteria, Lazy Associative Document Classification
Adriano Veloso
Wagner Meira Jr.
Marco Cristo
Mohammed Zaki
Marcos Goncalves
Coupling Feature Selection and Machine Learning Methods for Navigational
Query Identification
Yumao Lu
Xin Li
Fuchun Peng
Nawaaz Ahmed
Utility Scoring of Product Reviews
Zhu Zhang
Balaji Varadarajan
Task-based Process Know-how Reuse and Proactive Information Delivery in
TaskNavigator
Oleg Rostanin
Harald Holz
Takeshi Suzuki
Kaoru Maeda
Andreas Dengel
Katsumi Kanasaki
On the Structural Properties of Massive Telecom Call Graphs: Findings and
Implications
Dipanjan Chakraborty
Gautam Das
Siva Gurumurthy
Koustuv Dasgupta
Sougata Mukherjea
Anupam Joshi
Amit A. Nanavati
Processing Range-Constrained Distance Queries and Searching Nearest
Neighbors on Wavelet Synopses over Multiple Streams
Ming-Syan Chen
Hao-Ping Hung
Term Context Models for Information Retrieval
Jeremy Pickens
Andrew MacFarlane
A Fast and Robust Method for Web Page Template Detection and Removal
Altigran Silva
Edleno Moura
Joao Cavalcanti
Karane Vieira
Juliana Freire
Nick Pinto
Bayesian Adaptive User Profiling with Explicit & Implicit Feedback
Philip Zigoris
Yi Zhang
Evaluation by comparing result sets in context
Paul Thomas
David Hawking
A Data Stream Language and System Designed for Power and Extensibility
Yijian Bai
Hetal Thakkar
Chang Luo
Haixun Wang
Carlo Zaniolo
A System for Query-Specific Document Summarization
Ramakrishna Varadarajan
Vagelis Hristidis
Noun Phrase Semantic Interpretation with Cross-linguistic Evidence
Roxana Girju
Efficiently Clustering Transactional Data with Weighted Coverage Density
Hua Yan
keke Chen
Ling Liu
Joonsoo Bae
Incorporating Query Difference for Learning Retrieval Functions in World
Wide Web Search
Hongyuan Zha
Zhaohui Zheng
Haoying Fu
Gordon Sun
Heuristic Containment Check of Partial Tree-Pattern Queries in the Presence
of Index Graphs
Dimitri Theodoratos
Stefanos Souldatos
Pawel Placek
Timos Sellis
Theodore Dalamagas
Tracking Dragon-Hunters with Language Models
Anton Leuski
Victor Lavrenko
TRIPS and TIDES: New Algorithms for Tree Mining
Parthasarathy Srinivasan
Shirish Tatikonda
Tahsin Kurc
Estimating Average Precision with Incomplete and Imperfect Judgments
Javed Aslam
Emine Yilmaz
Knowing a Web Page by the Company It Keeps
Xiaoguang Qi
Brian Davison
Designing Semantics-Preserving Cluster Representatives for Scientific Input
Conditions
Aparna Varde
Mohammed Maniruzzaman
Elke Rundensteiner
Carolina Ruiz
David Brown
Richard Sisson
Eigen-Trend: Trend Analysis in the Blogosphere based on Singular Value
Decompositions
Yun Chi
Junichi Tatemura
Belle Tseng
Pruning Strategies for Mixed-Mode Querying
Vo Anh
Alistair Moffat
Accepted Poster Papers
-----------------------------
Retrieval Evaluation With Incomplete Relevance Data: A Comparative Study of
Three Measures
Leif Grönqvist
Per Ahlgren
Semi-automatic Annotation and MPEG-7 Authoring of Dance Videos
KANNAN RAJKUMAR
Balakrishnan Ramadoss
Resource-Aware Kernel Density Estimators over Streaming Data
Christoph Heinz
Bernhard Seeger
An Efficient One-Phase Holistic Twig Join Algorithm for XML Data
Zhewei Jiang
Cheng Luo
Wen Chi Hou
Qiang Zhu
Chi-Fang Wang
Representing Documents with Named Entities for Story Link Detection (SLD)
Chirag Shah
W. Bruce Croft
David Jensen
Query Taxonomy Generation for Web Search
ChenMing Hung
Pu-Jeng Cheng
Multi-Task Text Segmentation and Alignment Based on Weighted Mutual
Information
Bingjun Sun
Ding Zhou
Hongyuan Zha
John Yen
Approximate Reverse k-Nearest Neighbor Queries in General Metric Spaces
Peer Kröger
Elke Achtert
Christian Böhm
Peter Kunath
Matthias Renz
Alexey Pryakhin
Estimation, Sensitivity, and Generalization in Parameterized Retrieval
Models
Donald Metzler
A structure-oriented relevance feedback method from XML retrieval
Lobna Hlaoua
Karen Sauvagnat
Mohand Boughanem
Q-Rank: Re-Ranking Search Results Using Query Logs
Silviu Cucerzan
Ziming Zhuang
Integration of Cluster Ensemble and EM based Text Mining for Microarray
Gene Cluster Identification and Annotation
Xiaohua Hu
Xiaodan Zhang
Xiaohua Zhou
Best-k Queries on Database Systems
Tao Tao
Chengxiang Zhai
Mapping directories and OWL ontologies with AROMA
Jérôme David
Fabrice GUILLET
Henri BRIAND
Practical Private Data Matching Deterrent to Spoofing Attacks
Yanjiang Yang
Robert Deng
Feng Bao
The Visual Funding Navigator: Analysis of the NSF Funding Information
Shixia Liu
Nan Cao
Hao Lv
Hui Su
Towards Interactive Indexing for Large Chinese Calligraphic Character
Databases
Yi Zhuang
Yueting Zhuang
Qing Li
Lei Chen
Combining Classifiers to Organize Online Databases
Juliana Freire
Luciano Barbosa
Introduction to a new Farsi Stemmer
Alireza Mokhtaripour
Matching and Evaluation of Disjunctive Predicates for Data Stream Sharing
Richard Kuntschke
Alfons Kemper
Mining Coherent Patterns from Heterogeneous Microarray Data
Xiang Zhang
Wei Wang
Probabilistic Document-Context Based Relevance Feedback with Limited
Relevance Judgments
H. C. Wu
Robert W. P. Luk
K. F. Wong
K. L. Kwok
Rank Synopses for Efficient Time Travel on the Web Graph
Klaus Berberich
Srikanta Bedathur
Gerhard Weikum
A Neighborhood-Based Approach for Clustering of Linked Document Collections
Ralitsa Angelova
Stefan Siersdorfer
Adapting Association Patterns for Text Categorization: Weaknesses and
Enhancements
Tieyun Qian
Hui Xiong
Yuanzhen Wang
Enhong Chen
A Comparative Study on Classifying the Functions of Web Page Blocks
Xiangye Xiao
Qiong Luo
Xing Xie
Wei-Ying Ma
Effective and Efficient Similarity Search in Time Series
Andrea Tagarelli
Sergio Greco
Massimiliano Ruffolo
The Query Vector Document Model
Fabrizio Silvestri
Diego Puppin
Ranking in Context using Vector Spaces
Massimo Melucci
Hierarchical, Perceptron-like Learning for Ontology Based Information
Extraction
Yaoyong Li
Kalina Bontcheva
Hamish Cunningham
Direct Comparison of Commercial and Academic Retrieval System: an initial
study
Yefei Peng
Daqing He
Boosting Relevance Model Performance with Query Term Dependence
Koji Eguchi
W. Bruce Croft
Amnesic Online Synopses for Moving Objects
Michalis Potamias
Kostas Patroumpas
Timos Sellis
Collaborative Filtering in Dynamic Usage Environments
Olfa Nasraoui
Jeff Cerwinske
Carlos Rojas
Fabio Gonzalez
Multi-Query Optimization of Sliding Window Aggregates by Schedule
Synchronization
Lukasz Golab
Kumar Gaurav Bijay
M. Tamer Ozsu
Robust Periodicity Detection Algorithms
Parthasarathy Srinivasan
Sameep Mehta
Soundararajan Srinivasan
Search Result Summarization and Disambiguation via Contextual Dimensions
Sachindra Joshi
Krishna P Chitrapura
Raghuram Krishnapuram
PEPX: A Query-Friendly Probabilistic XML Database
Yi Chen
Te Li
Qihong Shao
Processing Information Intent via Weak Labeling
Anthony Tomasic
John Zimmerman
Isaac Simmons
Maximizing the sustained throughput of distributed continuous queries
Themis Palpanas
Ioana Stanoi
George Mihaila
Christian Lang
Query-specific clustering of Search Results based on Document-Context
Similarity Scores
Edward Dang
Integrated RFID Data Modeling: An Approach for Querying Physical Objects in
Pervasive Computing
Shaorong Liu
Fusheng Wang
Peiya Liu
Modeling Performance-Driven Workload Characterization of Web Search Systems
Claudine Badue
Ricardo Baeza-Yates
Artur Ziviani
Nivio Ziviani
Berthier Ribeiro-Neto
Constructing Better Document and Query Models with Markov Chains
Guihong Cao
Jian-Yun Nie
Jing Bai
IR Principles for Content-based Indexing and Retrieval of Brain Images
Bing Bai
Paul Kantor
Nicu Cornea
Deborah Silver
Exploring Feature Selection for Multi-Label Text Classification using
Ranked Retrieval Measures
J. Scott Olsson
Douglas Oard
Improving Query Translation with Confidence Estimation for Cross Language
Information Retrieval
Youssef Kadri
HUX: A Schema-centric Approach for Updating XML Views
Ling Wang
Elke Rundensteiner
Murali Mani
Ming Jiang
Pisa : Progressive Mining of Sequential Patterns
Ming-Syan Chen
Jian-Chih Ou
Jen-Wei Huang
Chi-Yao Tseng
Mining Multiple Private Databases using a Privacy Preserving kNN Classifier
Li Xiong
Subramanyam Chitti
Ling Liu
Pseudo-Anchor Text Extraction for Searching Vertical Objects
Shuming Shi
Zaiqing Nie
Ji-Rong Wen
Fei Xing
Mingjie Zhu
Continuous Keyword Search on Multiple Text Streams
Vagelis Hristidis
Oscar Valdivia
Michail Vlachos
Philip Yu
Information Retrieval from Relational Databases using Semantic Queries
Anand Ranganathan
Zhen Liu
Filtering or Adapting: Two Strategies to Exploit Noisy Parallel Corpora for
CLIR
Lixin Shi
Jian-Yun Nie
Efficient Mining of Max Frequent Patterns in a Generalized Environment
Donghui Zhang
Daniel Kunkle
Gene Cooperman
Measuring the Meaning in Time Series Clustering of Text Search Queries
Kristina Klinkner
Bing Liu
Rosie Jones
2008年7月11日星期五
Tutorials and Reviews on Boosting
Tutorials and Reviews on Boosting
- A short introduction to boosting by Freund and Schapire, 1999. An introduction to the theory and application of boosting.
- AdaBoost in action by Freund. An applet that shows how AdaBoost behaves during the training phase.
- Schapire's "boosting approach to machine learning".
- Ron Meir and Gunnar Ratsch's introduction to boosting and leveraging.
Reading Lists
I've highlighted the most important papers, even if they are ealier in the progression.
The AdaBoost Approach
- A decision-theoretic generalization of on-line learning and an application to boosting. Freund and Schapire 1995/7. You have to look at this paper, but you don't have to read it.
- Improved Boosting Algorithms Using Confidence-rated Predictions. Schapire and Singer 1999. Sections 1-4 are an absolute must read. Very concise and packed with useful interpretations.
- The boosting approach to machine learning: An overview. Schapire 2002. Section 6. Supplement with the original Friedman, Hastie, Tibshirani paper if desired. Describes an alternative and gentler loss (a.k.a. potential, potential loss, cost, etc) function.
- Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. Schapire et al. 1998. Sections 1,5,6. Understanding the idea is more important than the actual proofs.
- Boosting Algorithms as Gradient Descent. Mason et al. 2000. Similar in spirit to the view of Friedman, Hastie, and Tibshirani 1998. Sections 1 and 2 develop the AnyBoost framework, which is a helpful generalizations to AdaBoost.
The Boost By Majority Approach
- Boosting a weak learning algorithm by majority. Freund 1990/5. The first part of section 2 (before 2.1) describes the "boost by majority game".
- Improved Boosting Algorithms Using Confidence-rated Predictions. Schapire and Singer 1999. Sections 1-4 are an absolute must read. Very concise and packed with useful interpretations.
- An adaptive version of the boost by majority algorithm. Freund 1999/2000. IMHO, the biggest jump in boosting since Schapire's original algorithm. There are two main parts to the paper: 1) infinitely many iterations with infinitely small movements and 2) setting alpha by satisfying the constraints: a) expected value of $h(x)y$ is 0 w.r.t. the new weighting and b) the average difference in potential. If you understand these two ideas and their motivations (from Schapire and Singer 99 and Freund 1990/95) the algorithm falls out naturally.
- Drifting games. Schapire 2000. A very nice generalization of the boost by majority paradigm to the more "natural" space. This interpretation is also useful for understanding BrownBoost.
- Continuous Drifting games. Freund and Opper 2002. An extension of Schapire's drifting games to continuous domains. Both BrownBoost and NormalBoost are natural consequences of this work.
The Bregmen Divergences, Geometry, and Optimization Approach
- Boosting as Entropy Projection. Kivinen and Warmuth 1999. An interesting and very different approach that minimizes convex divergence measures.
- Logistic Regression, AdaBoost and Bregman Distances. Collins, Schapire, Singer 2000. Similar to Kivinen and Warmuth; however, it provides a few very practical results for converting AdaBoost to LogitBoost using an alternative loss function. Also ties in well with some of the statistical approaches to boosting.
- Linear Programming Boosting via Column Generation. Demiriz, Bennett, and Shawe-Taylor 2002. Introduction of LPBoost and general relationship to linear programming and boosting. A very complete discussion, showing how LPBoost can be used in many of the classic statistical settings.
- Totally Corrective Boosting Algorithms that Maximize the Margin. Warmuth, Liao, and Ratsch 2006. Puts the LPBoost and geometric (Bregman divergence of Kivinen and Warmuth) together. Instead of minimizing the correlation of the hypothesis and the subsequent weighting of the training set, TotalBoost minimizes the corrleations between
all previous hypotheses and the next weighting. This leads to sparser sets of hypotheses and a very nifty iteration bound. - Boosting Algorithms for Maximizing the Soft Margin. Warmuth, Glocer, and Ratsch 2007. Simple moral: TotalBoost overfits, SoftBoost is noise resistant. SoftBoost comes with an iteration bound and noise resistance using slack variables in SVM literature.
Statistical Approaches to Boosting
- Additive Logistic Regression: a Statistical View of Boosting. Friedman, Hastie, Tibshirani 1999. An interesting paper that casts boosting into the classic log likelihood model that all of statistics follows. The resulting algorithmic contribution, LogitBoost, can be implement in a AdaBoost framework (LogLossBoost in JBoost) using an alternative weighting. Be sure to read rejoinders as some of them gave me a chuckle.
- Friedman has a paper on \epsilon boosting which sounds very interesting, though I have yet to read it.
- Tong Zhang has a lot of very good papers. I'll list the highlights when I finish reading them... this may take a while.
- The question of consistency was a big deal in the statistics community and was approached by a large variety of people. They showed variants of boosting were consistent, but never AdaBoost. I believe Peter Bartlett gets credit for finally showing that AdaBoost is consistent.
- Evidence Contrary to the Statistcal View of Boosting. Mease and Wyner 2008. This paper is extremely important to read if you plan on doing research in boosting for practical purposes. It is not bulletproof (I'm sure someone will be able to find issues with it), but it is a well thought out and executed idea that uses empirical evidence instead of theory to drive boosting research. Some of the details of the paper may be proven slightly incorrect, but I believe taht the overall idea will stand up to scrutiny.
Misc Background Reading
- The strength of weak learnability. Schapire 1990. Cool algorithm, cool analysis, cool write up. Unfortunately rendered useless by majority vote methods.
- Any of the tutorials. There are a lot of really good boosting tutorials that cover the whole spectrum of applied to theoretical.
- The alternating decision tree learning algorithm. Freund and Mason 1999. An interesting way to use boosting with decision trees/stumps.
- Any of the empirical comparisons of different voting methods.
- The ideas of bagging and arcing are frequently mentioned in boosting papers. These are ideas of the extremely well respected statistician Leo Breiman. While they haven't caught on in the same way that boosting has (see "Boosting the margin..." for an explanation) they are interesting ideas and bagging (along with bootstrapping) is widely used for variance reduction of any estimator.
Boosting Software
In a non-accidental ordering:- JBoost. Powerful, extendible, highly tunable, efficient, and integratable into your own software. A bit of a learning curve, but well worth it.
- BoosTexter (also an implementation available in JBoost). Schapire's code for the text categorizing BoosTexter.
- GBM for R by Greg Ridgeway. A powerful and standardized module used mostly by statisticians.
- WEKA. User friendly.
2008年7月9日星期三
2008年7月2日星期三
Useful Tools for NLP
Information Retrieval
- Lemur/Indri
The Lemur Toolkit for Language Modeling and Information Retrieval
http://www.lemurproject.org/
Indri:
Lemur's latest search engine - Lucene/Nutch
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
http://lucene.apache.org/
http://www.nutch.org/ - WGet
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.
http://www.gnu.org/software/wget/wget.html
Natural Language Processing
- EGYPT: A Statistical Machine Translation Toolkit
http://www.clsp.jhu.edu/ws99/projects/mt/ - GIZA++ (Statistical Machine Translation)
http://www.fjoch.com/GIZA++.html
GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och. - PHARAOH (Statistical Machine Translation)
http://www.isi.edu/licensed-sw/pharaoh/
a beam search decoder for phrase-based statistical machine translation models - OpenNLP:
http://opennlp.sourceforge.net/ - MINIPAR by Dekang Lin (Univ. of Alberta, Canada)
MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.
http://www.cs.ualberta.ca/~lindek/minipar.htm - WordNet
http://wordnet.princeton.edu/
WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
WordNet was developed by the Cognitive Science Laboratory at Princeton University under the direction of Professor George A. Miller (Principal Investigator). - HowNet
http://www.keenage.com/
HowNet is an on-line common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts as connoting in lexicons of the Chinese and their English equivalents. - Statistical Language Modeling Toolkit
http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
The CMU-Cambridge Statistical Language Modeling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models. - SRI Language Modeling Toolkit
www.speech.sri.com/projects/srilm/
SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995. - ReWrite Decoder
http://www.isi.edu/licensed-sw/rewrite-decoder/
The ISI ReWrite Decoder Release 1.0.0a by Daniel Marcu and Ulrich Germann. It is a program that translates from one natural languge into another using statistical machine translation. - GATE (General Architecture for Text Engineering)
http://gate.ac.uk/
A Java Library for Text Engineering
Machine Learning
YASMET: Yet Another Small MaxEnt Toolkit (Statistical Machine Learning)
http://www.fjoch.com/YASMET.html- LibSVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC ), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM ). It supports multi-class classification. - SVM Light
由cornell的Thorsten Joachims在dortmund大学时开发,成为LibSVM之后最为有名的SVM软件包。开源,用C语言编写,用于ranking问题
http://svmlight.joachims.org/ - CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
a software package for clustering low- and high-dimensional datasets - CRF++
http://chasen.org/~taku/software/CRF++/
Yet Another CRF toolkit for segmenting/labelling sequential data
CRF(Conditional Random Fields),由HMM/MEMM发展起来,广泛用于IE、IR、NLP领域 - SVM Struct
http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html
SVMstruct is a Support Vector Machine (SVM) algorithm for predicting multivariate outputs. It performs supervised learning by approximating a mapping
h: X --> Y
using labeled training examples (x1,y1), ..., (xn,yn).
Unlike regular SVMs, however, which consider only univariate predictions like in classification and regression, SVMstruct can predict complex objects y like trees, sequences, or sets. Examples of problems with complex outputs are natural language parsing, sequence alignment in protein homology detection, and markov models for part-of-speech tagging.
SVMstruct can be thought of as an API for implementing different kinds of complex prediction algorithms. Currently, we have implemented the following learning tasks:
SVMmulticlass: Multi-class classification. Learns to predict one of k mutually exclusive classes. This is probably the simplest possible instance of SVMstruct and serves as a tutorial example of how to use the programming interface.
SVMcfg: Learns a weighted context free grammar from examples. Training examples (e.g. for natural language parsing) specify the sentence along with the correct parse tree. The goal is to predict the parse tree of new sentences.
SVMalign: Learning to align sequences. Given examples of how sequence pairs align, the goal is to learn the substitution matrix as well as the insertion and deletion costs of operations so that one can predict alignments of new sequences.
SVMhmm: Learns a Markov model from examples. Training examples (e.g. for part-of-speech tagging) specify the sequence of words along with the correct assignment of tags (i.e. states). The goal is to predict the tag sequences for new sentences.
Misc
- Notepad++
一个开源编辑器,支持C#,perl,CSS等几十种语言的关键字,功能可与新版的UltraEdit,Visual Studio .NET媲美
http://notepad-plus.sourceforge.net - WinMerge:
用于文本内容比较,找出不同版本的两个程序的差异
winmerge.sourceforge.net/ - OpenPerlIDE:
开源的perl编辑器,内置编译、逐行调试功能
open-perl-ide.sourceforge.net/
ps: 论起编辑器偶见过的最好的还是VS.NET了,在每个function前面有+/-号支持expand/collapse,支持区域copy/cut /paste,使用ctrl+ c/ctrl+x/ctrl+v可以一次选取一行,使用ctrl+k+c/ctrl+k+u可以comment/uncomment多行,还有还 有...... Visual Studio .NET is really kool:D - Berkeley DB
http://www.sleepycat.com/
Berkeley DB不是一个关系数据库,它被称做是一个嵌入式数据库:对于c/s模型来说,它的client和server共用一个地址空间。由于数据库最初是从文件系 统中发展起来的,它更像是一个key-value pair的字典型数据库。而且数据库文件能够序列化到硬盘中,所以不受内存大小限制。BDB有个子版本Berkeley DB XML,它是一个xml数据库:以xml文件形式存储数据?BDB已被包括microsoft、google、HP、ford、motorola等公司嵌 入到自己的产品中去了
Berkeley DB (libdb) is a programmatic toolkit that provides embedded database support for both traditional and client/server applications. It includes b+tree, queue, extended linear hashing, fixed, and variable-length record access methods, transactions, locking, logging, shared memory caching, database recovery, and replication for highly available systems. DB supports C, C++, Java, PHP, and Perl APIs.
It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures - anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are "store this value under this key", "check if this key exists" and "retrieve the value for this key" so conceptually it's pretty simple - the complicated stuff all happens under the hood.
case study:
Ask Jeeves uses Berkeley DB to provide an easy-to-use tool for searching the Internet.
Microsoft uses Berkeley DB for the Groove collaboration software
AOL uses Berkeley DB for search tool meta-data and other services.
Hitachi uses Berkeley DB in its directory services server product.
Ford uses Berkeley DB to authenticate partners who access Ford's Web applications.
Hewlett Packard uses Berkeley DB in serveral products, including storage, security and wireless software.
Google uses Berkeley DB High Availability for Google Accounts.
Motorola uses Berkeley DB to track mobile units in its wireless radio network products. - LaTeX
LATEX, written as LaTeX in plain text, is a document preparation system for the TeX typesetting program.
It offers programmable desktop publishing features and extensive facilities for automating most aspects of typesetting and desktop publishing, including numbering and cross-referencing, tables and figures, page layout, bibliographies, and much more. LaTeX was originally written in 1984 by Leslie Lamport and has become the dominant method for using TeX—few people write in plain TeX anymore. The current version is LaTeX2ε.
中文套装可以在http://www.ctex.org找到
http://learn.tsinghua.edu.cn:8080/2001315450/comp.html by王垠 - EditPlus
http://www.editplus.com/
EditPlus is an Internet-ready 32-bit text editor, HTML editor and programmers editor for Windows. While it can serve as a good replacement for Notepad, it also offers many powerful features for Web page authors and programmers.
EditPlus当前最新版本是2.21,BrE和AmE的spell checker需要单独下载安装包安装 - GVim: Vi IMproved
http://www.vim.org/index.php
Vim is an advanced text editor that seeks to provide the power of the de-facto Unix editor 'Vi', with a more complete feature set. It's useful whether you're already using vi or using a different editor. Users of Vim 5 should consider upgrading to Vim 6, which is greatly enhanced since Vim 5. Vim is often called a "programmer's editor," and so useful for programming that many consider it an entire IDE. It's not just for programmers, though. Vim is perfect for all kinds of text editing, from composing email to editing configuration files.
普通windows用户可以从这个链接下载ftp://ftp.vim.org/pub/vim/pc/gvim64.exe - Cygwin: GNU + Cygnus + Windows
http://www.cygwin.com/
Cygwin is a Linux-like environment for Windows. It consists of two parts: A DLL (cygwin1.dll) which acts as a Linux API emulation layer providing substantial Linux API functionality. A collection of tools, which provide Linux look and feel. - MinGW: Minimalistic GNU for Windows
http://www.mingw.org/
MinGW: A collection of freely available and freely distributable Windows specific header files and import libraries combined with GNU toolsets that allow one to produce native Windows programs that do not rely on any 3rd-party C runtime DLLs.
在windows下编译、移植unix/linux平台的软件。cygwin相当于在windows系统层上模拟了一个POSIX-compliant的 layer(库文件是cygwin1.dll);而mingw则是使用 windows自身的库文件(msvcrt.dll)实现了一些符合POSIX spec的功能,并不是完全POSIX-compliant。mingw其实是cygwin的一个branch,由于它没有实现linux api的模拟层,所以开销要比cygwin低些。 - CutePDF Writer
http://www.cutepdf.com
Portable Document format (PDF) is the de facto standard for the secure and reliable distribution and exchange of electronic documents and forms around the world. CutePDF Writer (formerly CutePDF Printer) is the free version of commercial PDF creation software. CutePDF Writer installs itself as a "printer subsystem". This enables virtually any Windows applications (must be able to print) to create professional quality PDF documents - with just a push of a button!
比起acrobat来,一大优点就是它是免费的。而且一般word图表、公式的转换效果很好,what you see is what you get,哈哈。可能需要ps2pdf converter,在该站点有链接提供下载 - R
http://www.r-project.org/
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
An algorithm for suffix stripping
An algorithm for suffix stripping
Originally published in Program, 14 no. 3, pp 130-137, July 1980.
1. Introduction
Removing suffixes by automatic means is an operation which is especially useful in the field of information retrieval. In a typical IR environment, one has a collection of documents, each described by the words in the document title and possibly by words in the document abstract. Ignoring the issue of precisely where the words originate, we can say that a document is represented by a vetor of words, or terms. Terms with a common stem will usually have similar meanings, for example:CONNECTFrequently, the performance of an IR system will be improved if term groups such as this are conflated into a single term. This may be done by removal of the various suffixes -ED, -ING, -ION, IONS to leave the single term CONNECT. In addition, the suffix stripping process will reduce the total number of terms in the IR system, and hence reduce the size and complexity of the data in the system, which is always advantageous.
CONNECTED
CONNECTING
CONNECTION
CONNECTIONS
Many strategies for suffix stripping have been reported in the literature.(e.g. 1-6) The nature of the task will vary considerably depending on whether a stem dictionary is being used, whether a suffix list is being used, and of course on the purpose for which the suffix stripping is being done. Assuming that one is not making use of a stem dictionary, and that the purpose of the task is to improve IR performance, the suffix stripping program will usually be given an explicit list of siffixes, and, with each suffix, the criterion under which it may be removed from a word to leave a valid stem. This is the approach adopted here. The main merits of the present program are that it is small (less than 400 lines of BCPL), fast (it will process a vocabulary of 10,000 different words in about 8.1 seconds on the IBM 370/165 at Cambridge University), and reasonably simple. At any rate, it is simple enough to be described in full as an algorithm in this paper. (The present version in BCPL is freely available from the author. BCPL is itself available on a wide range of different computers, but anyone wishing to use the program should have little difficulty in coding it up in other programming languages.) Given the speed of the program, it would be quite realistic to apply it to every word in a large file of continuous text, although for historical reasons we have found it convenient to apply it only to relatively small vocabulary lists derived from continuous text files.
In any suffix stripping program for IR work, two points must be borne in mind. Firstly, the suffixes are being removed simply to improve IR performance, and not as a linguistic exercise. This means that it would not be at all obvious under what circumstances a suffix should be removed, even if we could exactly determine the suffixes of a word by automatic means.
Perhaps the best criterion for removing suffixes from two words W1 and W2 to produce a single stem S, is to say that we do so if there appears to be no difference between the two statements `a document is about W1' and `a document is about W2'. So if W1=`CONNECTION' and W2=`CONNECTIONS' it seems very reasonable to conflate them to a single stem. But if W1=`RELATE' and W2=`RELATIVITY' it seems perhaps unreasonable, especially if the document collection is concerned with theoretical physics. (It should perhaps be added that RELATE and RELATIVITY are conflated together in the algorithm described here.) Between these two extremes there is a continuum of different cases, and given two terms W1 and W2, there will be some variation in opinion as to whether they should be conflated, just as there is with deciding the relevance of some document to a query. The evaluation of the worth of a suffix stripping system is correspondingly difficult.
The second point is that with the approach adopted here, i.e. the use of a suffix list with various rules, the success rate for the suffix stripping will be significantly less than 100% irrespective of how the process is evaluated. For example, if SAND and SANDER get conflated, so most probably will WAND and WANDER. The error here is that the -ER of WANDER has been treated as a suffix when in fact it is part of the stem. Equally, a suffix may completely alter the meaning of a word, in which case its removal is unhelpful. PROBE and PROBATE for example, have quite distinct meanings in modern English. (In fact these would not be conflated in our present algorithm.) There comes a stage in the development of a suffix stripping program where the addition of more rules to increase the performance in one area of the vocabulary causes an equal degradation of performance elsewhere. Unless this phenomenon is noticed in time, it is very easy for the program to become much more complex than is really necessary. It is also easy to give undue emphasis to cases which appear to be important, but which turn ut to be rather rare. For example, cases in which the root of a word changes with the addition of a suffix, as in DECEIVE/DECEPTION, RESUME/RESUMPTION, INDEX/INDICES occur much more rarely in real vocabularies than one might at first suppose. In view of the error rate that must in any case be expected, it did not seem worthwhile to try and cope with these cases.
It is not obvious that the simplicity of the present program is any demerit. In a test on the well-known Cranfield 200 collection7 it gave an improvement in retrieval performance when compared with a very much more elaborate program which has been in use in IR research in Cambridge since 1971(2,6). The test was done as follows: the words of the titles and abstracts in the documents were passed through the earlier suffix stripping system, and the resultis stems were used to index the documents. The words of the queries were reduced to stems in the same way, and the documents were ranked for each query using term coordination matching of query against document. From these rankings, recall and precision values were obtained using the standard recall cutoff method. The entire process was then repeated using the suffix stripping system described in this paper, and the results were as follows:
earlier system present systemCleary, the performance is not very different. The important point is that the earlier, more elaborate system certainly performs no better than the present, simple system.
-------------- --------------
precision recall precision recall
0 57.24 0 58.60
10 56.85 10 58.13
20 52.85 20 53.92
30 42.61 30 43.51
40 42.20 40 39.39
50 39.06 50 38.85
60 32.86 60 33.18
70 31.64 70 31.19
80 27.15 80 27.52
90 24.59 90 25.85
100 24.59 100 25.85
(This test was done by prof. C.J. van Rijsbergen.)
2. The Algorithm
To present the suffix stripping algorithm in its entirety we will need a few difinitions.A consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. (The fact that the term `consonant' is defined to some extent in terms of itself does not make it ambiguous.) So in TOY the consonants are T and Y, and in SYZYGY they are S, Z and G. If a letter is not a consonant it is a vowel.
A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will be denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any word, or part of a word, therefore has one of the four forms:
CVCV ... CThese may all be represented by the single form
CVCV ... V
VCVC ... C
VCVC ... V
[C]VCVC ... [V]where the square brackets denote arbitrary presence of their contents. Using (VC)m to denote VC repeated m times, this may again be written as
[C](VC)m[V].m will be called the measure of any word or word part when represented in this form. The case m = 0 covers the null word. Here are some examples:
m=0 TR, EE, TREE, Y, BY.The rules for removing a suffix will be given in the form
m=1 TROUBLE, OATS, TREES, IVY.
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
(condition) S1 -> S2This means that if a word ends with the suffix S1, and the stem before S1 satisfies the given condition, S1 is replaced by S2. The condition is usually given in terms of m, e.g.
(m > 1) EMENT ->Here S1 is `EMENT' and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2.
The `condition' part may also contain the following:
*S - the stem ends with S (and similarly for the other letters).
*v* - the stem contains a vowel.
*d - the stem ends with a double consonant (e.g. -TT, -SS).
*o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).
And the condition part may also contain expressions with and, or and not, so that
(m>1 and (*S or *T))tests for a stem with m>1 ending in S or T, while
(*d and not (*L or *S or *Z))tests for a stem ending witha double consonant other than L, S or Z. Elaborate conditions like this are required only rarely.
In a set of rules written beneath each other, only one is obeyed, and this will be the one with the longest matching S1 for the given word. For example, with
SSES -> SS(here the conditions are all null) CARESSES maps to CARESS since SSES is the longest match for S1. Equally CARESS maps to CARESS (S1=`SS') and CARES to CARE (S1=`S').
IES -> I
SS -> SS
S ->
In the rules below, examples of their application, successful or otherwise, are given on the right in lower case. The algorithm now follows:
Step 1a
SSES -> SS caresses -> caress
IES -> I ponies -> poni
ties -> ti
SS -> SS caress -> caress
S -> cats -> cat
Step 1b
(m>0) EED -> EE feed -> feedIf the second or third of the rules in Step 1b is successful, the following is done:
agreed -> agree
(*v*) ED -> plastered -> plaster
bled -> bled
(*v*) ING -> motoring -> motor
sing -> sing
AT -> ATE conflat(ed) -> conflateThe rule to map to a single letter causes the removal of one of the double letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes -ATE, -BLE and -IZE can be recognised later. This E may be removed in step 4.
BL -> BLE troubl(ed) -> trouble
IZ -> IZE siz(ed) -> size
(*d and not (*L or *S or *Z))
-> single letter
hopp(ing) -> hop
tann(ed) -> tan
fall(ing) -> fall
hiss(ing) -> hiss
fizz(ed) -> fizz
(m=1 and *o) -> E fail(ing) -> fail
fil(ing) -> file
Step 1c
(*v*) Y -> I happy -> happiStep 1 deals with plurals and past participles. The subsequent steps are much more straightforward.
sky -> sky
Step 2
(m>0) ATIONAL -> ATE relational -> relateThe test for the string S1 can be made fast by doing a program switch on the penultimate letter of the word being tested. This gives a fairly even breakdown of the possible values of the string S1. It will be seen in fact that the S1-strings in step 2 are presented here in the alphabetical order of their penultimate letter. Similar techniques may be applied in the other steps.
(m>0) TIONAL -> TION conditional -> condition
rational -> rational
(m>0) ENCI -> ENCE valenci -> valence
(m>0) ANCI -> ANCE hesitanci -> hesitance
(m>0) IZER -> IZE digitizer -> digitize
(m>0) ABLI -> ABLE conformabli -> conformable
(m>0) ALLI -> AL radicalli -> radical
(m>0) ENTLI -> ENT differentli -> different
(m>0) ELI -> E vileli - > vile
(m>0) OUSLI -> OUS analogousli -> analogous
(m>0) IZATION -> IZE vietnamization -> vietnamize
(m>0) ATION -> ATE predication -> predicate
(m>0) ATOR -> ATE operator -> operate
(m>0) ALISM -> AL feudalism -> feudal
(m>0) IVENESS -> IVE decisiveness -> decisive
(m>0) FULNESS -> FUL hopefulness -> hopeful
(m>0) OUSNESS -> OUS callousness -> callous
(m>0) ALITI -> AL formaliti -> formal
(m>0) IVITI -> IVE sensitiviti -> sensitive
(m>0) BILITI -> BLE sensibiliti -> sensible
Step 3
(m>0) ICATE -> IC triplicate -> triplic
(m>0) ATIVE -> formative -> form
(m>0) ALIZE -> AL formalize -> formal
(m>0) ICITI -> IC electriciti -> electric
(m>0) ICAL -> IC electrical -> electric
(m>0) FUL -> hopeful -> hope
(m>0) NESS -> goodness -> good
Step 4
(m>1) AL -> revival -> revivThe suffixes are now removed. All that remains is a little tidying up.
(m>1) ANCE -> allowance -> allow
(m>1) ENCE -> inference -> infer
(m>1) ER -> airliner -> airlin
(m>1) IC -> gyroscopic -> gyroscop
(m>1) ABLE -> adjustable -> adjust
(m>1) IBLE -> defensible -> defens
(m>1) ANT -> irritant -> irrit
(m>1) EMENT -> replacement -> replac
(m>1) MENT -> adjustment -> adjust
(m>1) ENT -> dependent -> depend
(m>1 and (*S or *T)) ION -> adoption -> adopt
(m>1) OU -> homologou -> homolog
(m>1) ISM -> communism -> commun
(m>1) ATE -> activate -> activ
(m>1) ITI -> angulariti -> angular
(m>1) OUS -> homologous -> homolog
(m>1) IVE -> effective -> effect
(m>1) IZE -> bowdlerize -> bowdler
Step 5a
(m>1) E -> probate -> probat
rate -> rate
(m=1 and not *o) E -> cease -> ceas
Step 5b
(m > 1 and *d and *L) -> single letter
controll -> control
roll -> roll
The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach. It was merely observed that m could be used quite effectively to help decide whether or not it was wise to take off a suffix. For example, in the following two lists:
list A list B-ATE is removed from the list B words, but not from the list A words. This means that the pairs DERIVATE/DERIVE, ACTIVATE/ACTIVE, DEMONSTRATE/DEMONS- TRABLE, NECESSITATE/NECESSITOUS, will conflate together. The fact that no attempt is made to identify prefixes can make the results look rather inconsistent. Thus PRELATE does not lose the -ATE, but ARCHPRELATE becomes ARCHPREL. In practice this does not matter too much, because the presence of the prefix decreases the probability of an erroneous conflation.
------ ------
RELATE DERIVATE
PROBATE ACTIVATE
CONFLATE DEMONSTRATE
PIRATE NECESSITATE
PRELATE RENOVATE
Complex suffixes are removed bit by bit in the different steps. Thus GENERALIZATIONS is stripped to GENERALIZATION (Step 1), then to GENERALIZE (Step 2), then to GENERAL (Step 3), and then to GENER (Step 4). OSCILLATORS is stripped to OSCILLATOR (Step 1), then to OSCILLATE (Step 2), then to OSCILL (Step 4), and then to OSCIL (Step 5). In a vocabulary of 10,000 words, the reduction in size of the stem was distributed among the steps as follows:
Suffix stripping of a vocabulary of 10,000 wordsThe resulting vocabulary of stems contained 6370 distinct entries. Thus the suffix stripping process reduced the size of the vocabulary by about one third.
------------------------------------------------
Number of words reduced in step 1: 3597
" 2: 766
" 3: 327
" 4: 2424
" 5: 1373
Number of words not reduced: 3650
Referencies
1. LOVINS, J.B. Development of a Stemming Algorithm. Mechanical Translation and computation Linguistics. 11 (1) March 1968 pp 23-31.2. ANDREWS, K. The Development of a Fast Conflation Algorithm for English. Dissertation for the Diploma in Computer Science, Computer Laboratory, University of Cambridge, 1971.
3. PETRARCA, A.E. and LAY W.M. Use of an automatically generated authority list to eliminate scattering caused by some singular and plural main index terms. Proceedings of the American Society for Information Science, 6 1969 pp 277-282.
4. DATTOLA, Robert T. FIRST: Flexible Information Retrieval System for Text. Webster N.Y: Xerox Corporation, 12 Dec 1975.
5. COLOMBO, D.S. and NIEHOFF R.T. Final report on improved access to scientific and technical information through automated vocabulary switching. NSF Grant No. SIS75-12924 to the National Science Foundation.
6. DAWSON, J.L. Suffix Removal and Word Conflation. ALLC Bulletin, Michaelmas 1974 p.33-46.
7. CLEVERDON, C.W., MILLS J. and KEEN M. Factors Determining the Performance of Indexing Systems 2 vols. College of Aeronautics, Cranfield 1966.
2008年7月1日星期二
最受欢迎的编程语言
在最新一期的Computer杂志(Feb. 2007, Vol. 40, No. 2) 里,有一篇叫Developers Shift to Dynamic Programming Languages的文章。文章简单介绍了一下dynamic language目前的发展和使用状况。其中给出了一个来自Tiobe Software的关于计算机语言的受欢迎程度的调查。根据这个调查,目前最受欢迎的20种计算机语言排名如下:
- Java
- C
- C++
- Visual Basic
- PHP
- Perl
- C#
- Python
- JavaScript
- Ruby
- SAS
- Delphi
- PL/SQL
- D
- ABAP
- Lisp/Scheme
- Ada
- Cobol
- Pascal
- Transact/SQL
其中1-4,7,12,14,17,19,20是static language,其余的是dynamic language。Dynamic language中目前使用最广泛的是PHP。
文中还比较有意思的一个地方,是一位叫做Les Hatton的教授的观点:
Computing has proven to be a fashion industry with little or no relationship with engineering. Many new programming approaches are just something new to try before something newer comes along. Dynamic languages are just the current software-development fashion. They will appear, hang around for a while, and then disappear. This is what happens when fashion dictates progress rather than engineering concepts such as measurement, validation, root-cause analysis, and defect prevention.
Languages By Keyboard
刚才无聊在网上瞎逛,逛到这页: Languages By Keyboard。
说是如何根据你的键盘的磨损情况来判断你编程使用的计算机语言,摘抄如下:
What's your type , my friend?
- C Programmer: Their ‘*’ and ‘;’ keys are worn out.
- C++ Programmer: Their ‘>’ and ‘<' keys are worn out.
- Lisp Programmer: Their ‘(’ and ‘)’ keys are worn out.
- OCaml Programmer: Their ‘;’ key is worn out.
- ALGOL Programmer: Their ‘:’ and ‘=’ keys are worn out.
- Forth Programmer: Their ‘:’ and ‘;’ keys are worn out.
- x86 ASM Programmer: Their ‘%’ key is worn out.
- Haskell Programmer: Their ‘-’ and ‘>’ keys are worn out.
- Ruby Programmer: Their ‘e’, ‘n’ and ‘d’ keys are worn out.
- Python Programmer: Their tab key is worn out.
- Smalltalk Programmer: Their ‘:’ key is worn out.
- SQL Programmer: Their ’s’, ‘e’, ‘l’, ‘c’, and ‘t’ keys are worn out. (Actually, ‘a’,'n’,'d’)
- Ada Programmer: Their ‘i’ and ’s’ keys are worn out.
- Java Programmer: Their ‘p’, ‘u’, ‘b’, ‘l’, ‘i’, and ‘c’ keys are worn out.
- Brainfuck Programmer: Their ‘>’, ‘<' and '+', keys are worn out. The letter keys are untouched.
- Perl Programmer: Their punctuation keys (all of them) are worn out. And the letter keys are crisp and clean.
- COBOL Programmer: Their caps-lock key is worn out.
- VHDL Programmer: Their ‘<' and '=' keys are worn out.
- Fortran Programmer: Their shift keys and ‘c’ keys are worn out.
- Fortran 95 Programmer: Their shift keys and ‘1′ keys are worn out.
- Erlang Programmer: Their ‘.’, ‘-’ and ‘>’ keys are worn out.
- G-code Programmer: No keys are worn, because there’s a rubber keyboard protector (with metal shavings embedded in it).
- XML Programmer: Their ‘>’, ‘<', and '/' keys are worn out.
- sh Programmer: The “Ctrl” key is next to the ‘a’ key.
- Newbie Programmer: Their F1 key is worn out.
- APL Programmer: They have an APL keyboard, and their APL SelectricTypewriter ball is worn out.
- PHP Programmer: The key mapped to ‘$’ is worn out.
- Documentation Editor (using Word): The ‘e’, ‘Ctrl’, and ‘Alt’ keys are worn out.
- Experienced Documentation Editor (using Word): The ‘Ctrl’ and ’s’ keys are worn out.
- Documentation Editor (using LaTeX): The ‘\’ key is completely worn out.
- Data-Entry Clerk: The entire numeric keypad is worn out.
- Unlucky Programmer: The ‘m’, ‘o’, ‘n’, ’s’, ‘t’, ‘e’, ‘r’, ‘.’, and ‘c’ keys are worn out.
- Slacking Programmer: The ‘n’ key is worn out.
- Slacking, Opinionated Programmer: The ‘n’ key and the ‘!’ key are worn out.
- Slacking, Opinionated, Obnoxious Programmer: The ‘n’ key, the ‘!’ key, and the caps-lock key are worn out.
- GWBASIC programmer: The ? key and all the number keys are worn out.
- Windows(tm) programmer: The Ctrl, Alt and Delete keys are worn out.
- Unsure programmer: The Ctrl + ‘z’ keys are worn out.