2011年1月7日星期五

全半角转化

在windows中,中文和全角字符都占两个字节,并且使用了ASCII Chart 2 (codes 0x80–0xFF)。我们可以凭这一点来一个个检测用户输入的是否是中文和全角字符。实际上,全角字符的第一个字节总是被置为0xA3,而第二个字节则是相同半角字符码加上0x80(不包括空格)。如半角A为65,则全角A则是0x80(第一个字节)、193(第二个字节,0x80+65)。而对于中文来讲,它的第一个字节被置为大于0x80,(如'阿'为:176 162),我们可以在检测到中文时不进行转换。
全角空格比较特殊,两个字节相同,为0xA1 0xA1。

2010年11月19日星期五

几篇不错的Query分析的paper

《Learning Query Intent from Regularized Click Graphs》
《Understanding User Goals in Web Search》
《Functional Faceted Web Query Analysis》
《Coupling Feature Selection and Machine Learning Methods for Navigational Query Identification》
《Characterizing Query Intent From Sponsored Search Clickthrough Data》
《Survey and evaluation of query intent detection methods》

2009年3月20日星期五

[转载]Zipf定律,最省力原则及网络语言

有这么一个笑话,冬天快到了,一群印第安人问他们新上任的酋长,今年冬天会冷么;酋长说,会的。于是他们开始拼命搜集木材和食物,准备过冬。眼见如此情景,酋长担心万一碰上个暖冬,自己的威信可就被糟蹋了,于是他打电话给当地气象台问气象学家,今年冬天会冷么,气象学家说会的;酋长不放心,又问,您怎么这么肯定;气象学家答,你没看到么,那些印第安人都发了疯一样准备过冬呢。

类似的,语言学中也有一条著名的经验法则, 数学家们信奉它,因为他们认为这条法则是语言学家建立的;语言学家们信奉它,则是因为他们以为这是条被数学家们证明过的数学定律。当然,这条被哈佛语言学家Zipf提出并以他的名字命名的定律比起上述印第安气象学要靠谱的多:Zipf发现,如果把一种语言中的所有的词按照词频从大到小排序,并记录它们的排列位置,那么一个词的词频f,和它的位置r,近似满足如下关系f*r=k,其中k是一个常数。

掩藏在公式背后的意思是,对于同一个概念,说话者期望选择一个出现频率很高,但是词义较含糊的词来表达,而听者则希望接受到一个出现频率很低,相应更精确的词汇。极端情况下,说话者巴不得只用一个词就能表达天下所有的意思,而听者则最好是一个萝卜一个坑,一个概念只有一个词相对应。总之双方都指着对方多担待,自己省点事儿。Zipf将此称为最省力原则(Principle of Least Effort).

Zipf定律就是反映了说者和听者两者间讨价还价最后的折衷,即只有相当少的一些词能够表达很多语义,相应具有很高的出现频率;而绝大多数的词则能较准确的表达特定意思,也就只有较少的出现频率。

类似的描述是不是在别的什么地方出现过呢,没错,就是二八原则,或者说帕累托分布。说到底,Zipf分布和帕累托分布都是所谓的幂分布。

从最省力原则出发,来打量一下网络语言,譬如说,福特福克斯的网络昵称,我们会有什么发现呢?



去掉品牌本身(福克斯,Focus),我们可以看到譬如”小福”,”FKS”这样的昵称,也有相当大的曝光率。我们容易理解,从发贴作者而言,这两个词比原品牌名少些字符,更容易敲,但是它们好认么?作为非车迷的我,仅仅从字面上无论如何没法理解这两个词。但是关键就在于,如果放在一个汽车的语境中,它们其实具备相当准确的指向。作为佐证,在Google或者百度里搜索”FKS 车”,得到的结果绝大多数都是有关福特福克斯。
从这个小例子看,最省力原则在网络语言中依然有效,只是听者不再是一般意义上的受众(很多人理解网络昵称可能很费劲),而是特定社区里的成员。作为社区的烙印之一,成员受社区感 (Sense of Community) 驱使,会逐渐形成一套公共符号系统(Common symbol system),昵称便是这套符号系统的表现之一。

昵称其实属于公共符号系统中更广泛的一类形式——黑话(jargon),不论是你一句”天王盖地虎”,对方接”宝塔镇河妖”;还是”请各位福友帮忙”,下面响应”你的小福怎么了”,你就知道,哎呀,算是找到组织了。

2009年3月10日星期二

Books on Information Retrieval (General)

Modern Information Retrieval. R. Baeza-Yates, B. Ribeiro-Neto. Addison-Wesley, 1999. Currently the most widely used and cited.

Information Retrieval: Algorithms and Heuristics. D.A. Grossman, O. Frieder. Springer, 2004. Excellent textbook, #1 or #2 seller on Amazon.

Managing Gigabytes. I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, 1999. The authority on index construction and compression.

Finding Out About. R. Belew. CAMbridge UP, 2001. More suitable for undergraduate classes than other books listed here.

Information Retrieval: A Health and Biomedical Perspective. W.R. Hersh. Springer, 2002. As the title says: a health/biomedical perspective.

TREC: Experiment and Evaluation in Information Retrieval. E.M. Voorhees, D.K. Harman. MIT Press, 2005. A survey of recent research results.

Language Modeling for Information Retrieval. W.B. Croft, J. Lafferty. Springer, 2003. Language models are of increasing importance in IR.

Readings in Information Retrieval. K. Sparck Jones, P. Willett. Morgan Kaufmann, 1997. A collection of classical IR papers.

Recommended Reading for IR Research Students. A. Moffat, J. Zobel, D. Hawking. SIGIR Forum, 39(2), 2005. Not a book, but a collection of seminal papers, more up-to-date than Sparck-Jones et al.

Information Storage and Retrieval Systems. G. Kowalski, M.T. Maybury. Springer, 2005. "... takes a system approach, discussing all aspects of an Information Retrieval System."

The Geometry of Information Retrieval. C.J. van Risjbergen. Cambridge UP, 2004. Am ambitious attempt to develop quantum mechanics as a new foundation for IR.

Introduction to Modern Information Retrieval. G.G. Chowdhury. Neal-Schuman, 2003. Intended for students of library and information studies.

Text Information Retrieval Systems. C. Meadow, B. Boyce, D. Kraft. Academic Press, 2000. Also takes a library/information science perspective. More Books

Books on Web Information Retrieval

Mining the Web: Analysis of Hypertext and Semi Structured Data. S. Chakrabarti. Morgan Kaufmann, 2002. The best introduction for web-centric IR.

Google's PageRank and beyond: The science of Search Engine Rankings. Amy N. Langville, Carl D. Meyer. Princeton University Press, 2006. More focused on the algorithms of PageRank, but also covers general web IR.

Modeling the Internet and the Web: Probabilistic Methods and Algorithms. P. Baldi, P. Frasconi, P. Smyth. Wiley, 2003. A bit terse. Recommended for those who have a good foundation in probability theory, but are new to IR.

Online Books - Browsable

Introduction to Information Retrieval. C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2007. Draft. Focuses on algorithms and mathematical foundations without neglecting practical issues in building search systems. Equal coverage of classical IR and newer topics like XML, machine learning techniques and web search engines.

Finding Out About. R. Belew's book (w/o figures and equations), see above.

Information Retrieval. C. J. van Rijsbergen. Butterworths, 1979. The classic. Almost 40 years old, but still worth reading.

Information Retrieval. T. van der Weide. 2004. Introduction to IR and hypertext.
Online Books - PDF

Introduction to Information Retrieval. C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2007

. Information Retrieval in Practice. B. Croft, D. Metzler, T. Strohman. Pearson Education, 2009. (two chapters)

Information Retrieval. C. J. van Rijsbergen. Butterworths, 1979.

Information Retrieval Interaction. P. Ingwersen. Taylor Graham, 1992. Focuses on user interaction in IR.

Information Retrieval: A Survey. Ed Greengrass. 2000. Good survey of "classical" IR, but little or no coverage of recent work (e.g., language models, PageRank, SVMs). Various tutorials at Mi Islita
Research Centers
CMU (LTI)
Dublin CU
Geneva (Viper)
Glasgow
Helsinki Institute for Information Technology
IBM
Illinois Institute of Technology Information Retrieval Facility (IRF)
Microsoft Research
NIST
Peking
Pittsburgh
Queen Mary
Sheffield
UIUC
UMASS
U. of Washington

Courses
Berkeley (SIMS)
CMU
Cornell
DePaul
IIT
Johns Hopkins I
Johns Hopkins II
Maryland MPI
Otago
Princeton
Stanford
Stuttgart
Texas
UMASS
U. of Sunderland


Multimedia Information Retrieval

U. of Stuttgart
Problem Sets / Assignments
Cornell


U. of Massachusetts

-->Bilkent DePaul Georgetown Minas Gerais North Texas Stuttgart Tennessee
Web Information Retrieval webir.org Search Engine Watch Users' Guide to Web Searching PageRank
Subareas, Applications, Methods
Chemistry
-->Information Retrieval & Extraction Information Retrieval & Machine Learning Text Mining & Web Mining INEX: XML retrieval Geographic Information Retrieval Music Information Retrieval
Music Information Retrieval (2)
-->CLIR & Multilingual Information Retrieval
Cross-Language Information Retrieval (CLIR)
-->Cross-Language Information Retrieval (CLIR) Resources N-Grams in Information Retrieval Agent-based Information Retrieval Audio Information Retrieval Adversarial Information Retrieval
Conferences TREC Cross Language Evaluation Forum (CLEF) SIGIR 2007 (last), SIGIR 2008 (next) CIKM 2007, CIKM 2008 WWW 2008, WWW 2009 JCDL 2008, JCDL 2009 RIAO 2004, RIAO 2007 ECIR 2008, ECIR 2009 AIRS 2006, AIRS 2008 SPIRE 2007, SPIRE 2008 Norbert Fuhr's IR conference calendar
Journals ACM Transactions on Information Systems (TOIS): dblp home Information Processing and Management (IP&M): dblp home Information Retrieval: dblp home International Journal on Digital Libraries: dblp home Journal of the American Society of Information Science and Technology (JASIST): dblp home SIGIR Forum: dblp home Journal of Documentation D-Lib Magazine Data & Knowledge Engineering: dblp home Information Processing Letters: dblp home Information Research Information Systems: dblp home Journal of Intelligent Information Systems: dblp home Knowledge and Information Systems: dblp home Foundations and Trends in Information Retrieval: home
Popular Articles Wikipedia: Information Retrieval A. Singhal: Modern Information Retrieval: A Brief Overview S.E. Robertson, K. Sparck Jones: Simple, proven approaches to text retrieval Bruce Croft: What Do People Want From IR Information Retrieval on the World Wide Web Michael Lesk: The Seven Ages of Information Retrieval
Marcia J. Bates: ... Getting Web Information Retrieval Right ...

-->
Software C. Middleton, R. Baeza-Yates: A Comparison of Open Source Search Engines (contains an up-to-date list of available search engine software) Doug Oard's list of available text retrieval systems Avi Rappoport: open source search engines

ht://Dig

-->MySQL full text search

Swish-e

-->Text to Matrix Generator, a MATLAB toolbox for indexing, retrieval and other text processing tasks
Collections U. of Glasgow list of available text retrieval collections NLP/IR corpus list at NUS NLP/IR corpus list at Edinburgh Internet archive (limited availability) Linguistic Data Consortium
Professional Organizations ACM SIGIR BCS IRSG
Other Collections of Information Retrieval Links ACM SIGIRDavid Karger
Other Resources Glossary (Modern Information Retrieval) Information retrieval research links @ Search Tools BUBL: Information Retrieval Links LSU: Information Retrieval Systems Open Directory: Information Retrieval Links UBC: Indexing Resources IR & Neural Networks, Symbolic Learning, Genetic Algorithms A stop list (a list of stop words)
http://web.syr.edu/~diekemar/ir.html
>
IR links

(Syracuse)
http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html
>
IR links

(U. of Tokushima)
-->

IR resources

(Mark Sanderson)
-->

Open Directory: Information Retrieval

-->Chris Manning's NLP resources Weiguo Patrick Fan's text mining links

2008年12月31日星期三

Comming Conferences

A nice navigation page for comming conference about IR & NLP and other related fields.

http://www.cs.sfu.ca/~bzhou/personal/conference.html