March 20, 2006

Frederick Jelinek's Acceptance Speech

On November 22nd, 2001, Charles University in Prague presented an Honorary Doctorate to Prof. Frederick Jelinek. I would like to point you in the direction of his acceptance speech at this occasion (translated into English from the original Czech). Here's an excerpt to pique your interest:

By now it is certainly obvious that my life was full of detours and compromises. Opportunities for success were afforded me in the United States, the land of freedom, land of work, land of civilization, to quote an old song of the famous Czech clowns Voskovec and Werich.

Posted by anoop at 12:19 PM

October 30, 2005

Respect

That's what we need: some Respect!

David recently posted a link to a Computer magazine article on the 2005 NIST Machine Translation evaluation results and the status of contemporary statistical machine translation in an article called Statistical Machine Translation Gains Respect.

The article does a good job at summarizing interviews with a representative from each of the teams that participated in NIST MTEval 2005. Google, IBM, UMD, NRC, CMU and Edinburgh get interviewed. ISI and Aachen are noticably missing from the list of interviews.

There is the usual quota of amusing stuff that you find in popular science articles: watch out for the most awkward definition of n-gram yet, and don't miss a quote from a famous Edinburgh researcher saying that "translating to Chinese require systems to have high levels of linguistic knowledge". And breaking news: ``Google has several other [SMT] applications in mind, but won't comment further''.

Nobody else might find it interesting that the article's author, David Geer, lives in Ashtabula, Ohio.

Posted by anoop at 01:57 PM

June 15, 2005

The Embedding by Ian Watson

A true classic, published in the 1970s, the writing is still fresh and the plot has not dated at all. This is a remarkable first novel in the long and distinguished career of Ian Watson.

This is one of a handful of sf novels that bases its speculations on the scientific study of human language, and it well might be the best of the bunch. Ian Watson takes the assumption that the internal structures of human languages reflects the human ability to observe and explain the physical universe while at the same time he is sensitive to the notion of an specific computational mechanism in the brain that guides first language acquisition. The "Embedding" of the title is a reference to a particular aspect of this computational mechanism.

The protagonist of "The Embedding", Chris Sole, works in a British linguistics research facility where they conduct experiments (which would be illegal in a contemporary setting) on young children, altering their reality by isolating them in strange environments and altering their brains with drugs, evaluating the change in the linguistic structures they produce. Another thread of this novel follows the French anthropologist Pierre Darriand who lives with a tribe deep in the Amazonian forest called the Xemahoa, who distort their reality with ritual drug-taking and who produce, in a way similar to the children in the lab, highly embedded linguistic structures. There is a third thread about aliens who arrive on Earth hoping to trade some of their advanced knowledge for the knowledge collected by Earth linguists: a great premise although somewhat less compelling in execution.

The "Embedding" is a property of recursive rule systems used by linguists to describe certain aspects of natural language. Take, for example, the following noun phrase:

the shares that the broker recommended which were bought

Let's denote the noun phrase the broker by the abstract symbol N1 and the associated verb phrase for it recommended gets the symbol V1. Similarly, the shares is called N2 and were bought is called V2:

N2 N1 V1 V2

We assume that we have to keep track of N2 (that is, we cannot discard it) until we see the verb associated with it, V2. This is called an embedding of size 2.

Convince yourself that in the following example, the embedding is of size 2 and not 3:

the mutual fund that had a 4-year term and the shares that the broker recommended which were bought
N3 V3 N2 N1 V1 V2

So can we make the embedding more complex? Here is another example where the embedding is of size 3:

the mutual fund containing the shares that the broker recommended which were bought that had a 4-year term
N3 N2 N1 V1 V2 V3

Can you construct or imagine accepting an example (in the language of your choice) with embedding of 4? How about 5? It is clear that humans have a definite limit on the number of center embeddings they can accept or produce. Ian Watson imagines a method that can modify brain structures to accept embeddings of increasing sizes that normal humans cannot process. Why is this interesting? Computationally a deeper embedding has implications for how certain kinds of recursive patterns can be processed, although going from there to the transcendence of human linguistic ability, including the perception of multiple spatial dimensions, is quite a stretch, and this is what Ian Watson speculates about in this novel.

The use of the Xemahoa's awareness of time is probably an allusion to the famous linguistic/anthropological "discoveries" about the tense system of Hopi. Another possible insider joke is the abduction of an Eskimo Innuit speaker as part of the plot towards the end of the novel (see "The Great Eskimo Vocabulary Hoax" by Geoffrey Pullum).

Unlike previous attempts in science-fiction to deal with the connections between language, thought and perception (see, for example, "The Languages of Pao" by Jack Vance), Ian Watson plays with the various linguistic hypotheses on this topic in his fictional framework by introducing artificial means for changing the brain itself while simultaneously changing the reality experienced by a language learner.

However, the the overwhelming cynicism that pervades the book can get oppressive. This does not detract from the book and perhaps it is function of the time when this book was written. The constant berating of the Americans is one-sided and repetitive, imho. The new face of English Socialism in science-fiction, e.g. the output of authors like Ken MacLeod seem much gentler in comparison.

For a different take on this novel, read Pamela Sargent's Introduction to "The Embedding". Some other examples of linguistic speculations in sf novels are: "Native Tongue" by Suzette Haden Elgin' "The Languages of Pao" by Jack Vance and a critical survey of such sf novels in "Aliens and Linguists: Language Study and Science Fiction" by Walter M. Meyers.

Update : Mark Liberman has instituted the Trent Reznor Prize for Tricky Embedding, thanks to the following quote by Reznor from an interview:

"When I look at people that I would like to feel have been a mentor or an inspiring kind of archetype of what I'd love to see my career eventually be mentioned as a footnote for in the same paragraph, it would be, like, Bowie."

Here is a brief unraveling/analysis of the Reznor embedding:

When I look at people (with some characteristics) it would be Bowie (who fits those characteristics)
>> [NP people [RC1 that I feel have been a [NP mentor NP] or [NP archetype of [NP what I would love to see my career mentioned as a footnote for (archetype) or in the same paragraph as (archetype) NP] NP] RC1] NP] <<

NP = noun phrase
RCn = relative clause of embedding n
A1 = the argument of mentioned

%T The Embedding
%A Ian Watson
%I New York: Carroll and Graf Publishing Inc.
%D 1973
%G ISBN: 088184554X (pb)
%P 217
%K science-fiction

Posted by anoop at 02:31 PM

April 15, 2005

The Viterbi Algorithm

G. David Forney Jr. recently posted on arXiv an extremely interesting article on the history and development of the Viterbi algorithm: The Viterbi Algorithm: A Personal History.

The Viterbi algorithm originated as a decoding algorithm for convolution codes. However, since then it's utility has been widespread. In particular, in speech recognition and computational linguistics, the Viterbi algorithm is used for decoding the most likely state sequence and it is also used in the Forward-Backward algorithm in Hidden Markov Models.

The interesting points from the above article:

  • Viterbi devised the algorithm to help him teach:
    the Viterbi algorithm for convolution codes ... came out of my teaching ... I found information theory difficult to teach, so I started developing some tools.
  • The Viterbi algorithm, when first published, was not known to be related to dynamic programming methods and also not known to provide the optimal or maximum likelihood solution. The original paper states that:
    this decoding algorithm is clearly suboptimal
    It was G. D. Forney, Jr. who later proved that the Viterbi algorithm was an exact recursive algorithm for the shortest path through a trellis diagram. The relationship to dynamic programming then became clear.

The article also provides various places where the Viterbi algorithm has been used in practice, including the Galileo mission to Jupiter in 1992 (it was used to boost the transmission bandwidth when the primary antenna failed to deploy).

Of course, nowadays there are many applications of Viterbi in Computational Linguistics where it is used for many sequence learning tasks, from finding person names or gene names in text, to word segmentation in languages like Chinese, and in Biological Sequence Analysis where it is used to find exon or intron boundaries in DNA sequences.

The article also mentions various relationships between algorithms for "codes on graphs" and Pearl's belief propogation algorithm for Bayesian networks. The following paper is a good reference on this topic (this paper is cited in the above article, but was first pointed out to me by Hassan Ait-Kaci):

S. M. Aji and R. J. McEliece, "The generalized distributive law," IEEE Trans. Inform. Theory, vol. 46, pp. 325-343, Mar. 2000.

Posted by anoop at 02:05 PM

September 13, 2004

Graphviz Introduction

There is a good introduction to graphviz, the graph drawing tool from AT&T available at the Linux Journal web site.

Graphviz provides a general tool to visualize objects that are otherwise hard to see. One example of how graphviz can be used is in visualizing a forest which is a compact representation of a whole bunch of trees. It is compact because it does not duplicate common sub-trees. The figure below is one such forest that stores four simple trees (click on the figure to get a larger view).

It is a somewhat unorthodox view of a forest because entire (sub)trees are shown at each node instead of just non-terminals, so the forest as shown has some duplicated nodes (e.g. the four original trees) but it looks prettier. The figure above was produced by running some simple Perl code that I hacked together to convert a set of trees into a forest and store it in a format that can be read by graphviz tools.

For the original source code and binary distributions go to the Official GraphViz Web Site and the GraphViz Development Web Site. There is a very sophisticated native port to MacOSX.

There is a convenient Perl interface to graphviz available from CPAN and there is a C++ STL-style interface to graphviz that is part of the Boost library. There's even a MATLAB interface to graphviz.

Posted by anoop at 10:56 AM

June 25, 2004

The Great Eskimo Vocabulary Hoax by Geoffrey Pullum

This is a collection of 23 essays written by Geoff Pullum which originally appeared in the `TOPIC ... COMMENT' column in the journal `Natural Language and Linguistic Theory'. Pullum's columns ran in NLLT for six years. If you have ever read a paper on linguistics (published after 1950) then this book is required reading.

Pullum's essays are organized into four broad sections:

  • `Fashions and Tendencies' which contains essays about the practice of linguistics, for example in "Formal Linguistics meets the Boojum" he parodies the strange retreat from formalisms in formal linguistics.
  • `Publication and Damnation' consists of essays about the day to day work of working linguists who spend their days "Stalking the perfect journal".
  • `Unscientific Behaviour' catalogs among other topics how the subject of whether natural language is contained within the set of context-free languages was explored by linguists (in "Footloose and Context-Free") and how certain myths about language have a life of their own (in the eponymous "The Great Eskimo Vocabulary Hoax").
  • Finally, `Linguistic Fantasies' contains a loosely connected set of essays including a list of science fiction books about linguistics (in "Some lists of things about books") and a fascinating fictional(?) tale of how one linguistics book was written (in "The incident of the node vortex problem").

The subtitle of the book promises "Irreverent Essays on the Study of Language" and it delivers. You can find current writings by Geoff Pullum on similar topics appearing on the language log.

%T The Great Eskimo Vocabulary Hoax %T :And Other Irreverent Essays on the Study of Language %A Geoffrey K. Pullum %I The University of Chicago Press %D 1991 %G ISBN: 0226685330 (hc) %G ISBN: 0226685349 (pb) %P 236 %K science, linguistics

Review written: 1999/08/02

Posted by anoop at 02:07 PM

May 17, 2004

Parsing 'A Verbless Post'

This is getting a bit ridiculous, but here goes:

A follow up to a previous post about Part-of-speech Tagging 'A Verbless Post' in which Geoff Pullum's post to the language log was analyzed for parts of speech. This post uses Eugene Charniak's statistical parser (parser03) to produce a syntactic analysis of the contents (in the Penn Treebank notation).

First thing to notice in the parser output is that the recall for humourous points scored is substantially reduced due to the fact that no verb to Thaler is produced:

(S1 (S (CC And) (PP (IN in) (NP (DT that) (NN case))) (, ,) (NP (NP (DT a) (NN word)) (PP (IN of) (NP (NP (NN gratitude)) (PP (TO to) (NP (NNP Thaler)))))) (VP (PRN (-LRB- -LCB-) (ADVP (RB otherwise)) (NP (DT an) (JJ unimportant) (NN screwball)) (-RRB- -RCB-))) (. .)))

However, overall the poor parser is strained by the lack of verbs more than the tagger seemed to be, mainly due to the added pressure of producing legitimate syntactic structures. Because verb phrases occur frequently in the training data, the parser produces structures with spurious VPs in some unfamiliar contexts:

(S1 (S (NP (IN Except)) (VP (VBZ ..)) (. .)))

and:

(VP (VBZ nouns) (NP (, ,) (NNS pronouns) ... )

Our experience in trying to parse the output of a statistical machine translation system on the NIST 02/03 data for Chinese to English translation led to similar issues of hallucinated verb phrases for some of the ungrammatical English sentences output by the system. This behaviour is documented in this paper (from HLT-NAACL, 2004).

Understanding the notation of these parse trees is likely to be more challenging for the layperson (I would hope). For the intrepid reader, a good start would be the Penn Treebank manuals.

If you examine the full output of the Charniak parser on Geoff Pullum's post (shown below), there are some strange errors in punctuations, and the usual prepositional phrase (PP) and coordination (CC) attachment errors. But, overall, the performance is very good, especially for some useful constituents like noun phrases (NPs) or parentheticals (PRN).

(S1 (NP (DT A) (JJ verbless) (NN novel) (. ?))) (S1 (FRAG (WRB Why) (. ?) (. ?))) (S1 (NP (NP (WP What) (NN reason)) (PP (IN for) (NP (NP (DT the) (NN accomplishment)) (PP (IN by) (NP (NP (DT this) (JJ showy) (NN fool)) (PP (IN in) (NP (NP (NNP France)) (, ,) (NP (NNP Michel)))))))))) (S1 (FRAG (NP (NNP Thaler)) (, ,) (NP (NP (PRP$ his) (NN effort)) (PP (IN at) (NP (NP (DT an) (JJ entire) (NN novel)) (PP (IN with) (NP (DT no) (NNS verbs))))) (PRN (-LRB- -LCB-) (NP (RB perhaps) (RB not) (NP (DT a) (ADJP (JJ wise) (CC or) (JJ lucrative)) (NN publication) (NN venture)) (, ,) (VP (VBN given) (NP (NP (DT the) (RB not) (JJ total) (NN incorrectness)) (PP (IN of) (NP (PRP$ my) (NNS speculations)))))) (-RRB- -RCB-)) (ADJP (RB recently) (JJ evident))) (PP (IN amongst) (NP (NP (DT the) (JJ vast) (FW efflux)) (PP (IN of) (NP (NP (JJ absurd) (JJ literary) (NN pretense)) (PP (IN in) (NP (DT the) (JJ French) (NN language))))))) (. ?))) (S1 (FRAG (INTJ (UH Well)) (, ,) (SBAR (WHNP (WDT whatever)) (S (NP (PRP$ his) (NNS reasons)) (, ,) (PP (IN in) (NP (NN response))) (, ,) (NP (PRP$ my) (JJ own) (NN contribution)) (: :) (NP (NP (DT a) (JJ verbless) (NN post)) (-LRB- -LCB-) (NP (NP (DT the) (JJ first)) (PP (IN on) (NP (NN Language) (NN Log)))) (-RRB- -RCB-)))) (. .))) (S1 (S (NP (NP (DT No) (NNS verbs)) (PP (IN at) (NP (NP (DT all)) (PP (IN in) (NP (NP (DT this) (NN book)) (PP (IN of) (NP (NP (NNP Thaler) (POS 's)) (, ,) (ADVP (RB just))))))))) (VP (VBZ nouns) (NP (, ,) (NNS pronouns) (, ,) (NNS adjectives) (, ,) (NNS adverbs) (, ,) (NNS prepositions) (, ,) (NNS subordinators) (, ,) (NNS coordinators) (, ,) (CC and) (PRN (: --) (INTJ (UH oh) (. !)) (: --)) (NNS interjections))) (. .))) (S1 (S (NP (PDT All) (DT those)) (PP (IN among) (NP (DT the) (JJ permissible) (PRN (-LRB- -LCB-) (CC and) (PP (IN for) (NP (PRP him))) (, ,)) (NN past))) (VP (VBZ participles) (ADVP (RB too)) (, ,) (PP (IN though) (NP (NP (DT no) (JJ participial) (NNS intrusions)) (PP (IN in) (NP (DT this) (NN post))))) (, ,) (NP (NP (NP (PDT such) (DT the) (JJ extreme) (NN character)) (PP (IN of) (NP (PRP$ my) (ADJP (JJ cruel) (CC and) (JJ unreasonable)) (JJ self-applicable) (NNS strictures) (-RRB- -RCB-)))) (, ,) (CC but) (RB never) (NP (CD one) (JJ single) (JJ solitary) (NN verb)))) (. .))) (S1 (S (CC And) (, ,) (ADVP (RB fantastically)) (, ,) (NP (PDT all) (DT this)) (VP (NP (NP (NP (DT a) (NN vision)) (PP (IN of) (NP (NP (DT some) (NN liberation)) (PP (IN for) (NP (NNS authors)))))) (, ,) (RB not) (NP (NP (DT an) (JJ absurd) (JJ literary) (NN straitjacket)) (PP (IN with) (NP (DT the) (NN writer))))) (PRN (-LRB- -LCB-) (PP (IN albeit) (NP (RB willingly))) (-RRB- -RCB-)) (VP (VBN imprisoned) (PP (IN within) (NP (PRP it))))) (. .))) (S1 (NP (NP (DT Some) (NN freedom)) (, ,) (NP (DT this)) (. .))) (S1 (FRAG (NP (NNP Thaler)) (: :) (S (NP (NNS nuts) (, ,) (NNS bonkers) (, ,)) (VP (VBP round) (DT the) (VP (VB bend)))) (. .))) (S1 (NP (NP (JJ Mad)) (PP (IN as) (NP (DT a) (NNP March) (NN hare))) (. .))) (S1 (S (NP (DT The) (NNP Liberman) (NN conjecture)) (PRN (-LRB- -LCB-) (PP (IN about) (NP (NP (NN survival)) (PP (IN of) (NP (NP (JJ high) (NN school) (JJ literary) (NN experimentation)) (PP (IN into) (NP (NP (NN adulthood)) (PP (IN because) (IN of) (NP (DT a) (ADJP (JJ dysfunctional) (JJ authoritarian)) (JJ French) (JJ educational) (NN system))))))))) (-RRB- -RCB-)) (: :) (S (ADVP (RB probably)) (ADJP (JJ true))) (. .))) (S1 (NP (NP (PRP$ My) (NN attitude)) (: :) (NP (NP (NN contempt)) (, ,) (ADVP (RB really))) (. .))) (S1 (S (NP (IN Except)) (VP (VBZ ..)) (. .))) (S1 (FRAG (PP (IN Unless) (NP (CD ..))) (. .))) (S1 (S (ADVP (RB Just) (RB possibly)) (, ,) (NP (NP (DT an) (NN exercise)) (, ,) (PP (IN for) (NP (NP (DT the) (NNS undergraduates)) (PP (IN in) (NP (NP (PRP$ my) (NN course)) (PP (IN on) (NP (NNP English)))))))) (VP (NN grammar) (NP (DT this) (NN fall) (NN quarter))) (. .))) (S1 (NP (NP (DT An) (NN effort)) (PP (IN at) (NP (NP (NN construction)) (PP (IN of) (NP (NP (JJ fifty) (NNS words)) (PP (IN of) (NP (NP (JJ coherent) (NN prose)) (PP (IN with) (NP (NP (ADVP (RB never)) (DT a) (NN verb)) (, ,) (PP (IN with) (NP (NP (RB only) (DT those)) (PP (IN in) (NP (NP (NN possession)) (PP (IN of) (NP (NP (JJ enough) (JJ grammatical) (NN knowledge)) (PP (IN for) (NP (NP (JJ verb) (NN identification)) (ADJP (JJ capable) (PP (IN of) (NP (NN success)))))))))))))))))))) (. .))) (S1 (FRAG (ADJP (JJ Worth) (S (NP (DT a) (NN try)))) (, ,) (ADVP (RB perhaps)) (. .))) (S1 (S (CC And) (PP (IN in) (NP (DT that) (NN case))) (, ,) (NP (NP (DT a) (NN word)) (PP (IN of) (NP (NP (NN gratitude)) (PP (TO to) (NP (NNP Thaler)))))) (VP (PRN (-LRB- -LCB-) (ADVP (RB otherwise)) (NP (DT an) (JJ unimportant) (NN screwball)) (-RRB- -RCB-))) (. .))) (S1 (FRAG (NP (RB Always) (DT that) (JJ extra) (NN possibility)) (: :) (S (NP (DT the) (NN idea)) (VP (VBP justifiable) (PP (RB not) (PP (IN because) (IN of) (NP (PRP$ its) (NN implementation))) (, ,) (CC but) (PP (IN in) (NP (NP (NN virtue)) (PP (IN of) (NP (NP (DT a) (ADJP (JJ complementary) (CC or) (JJ counterposed)) (NN idea) (NN emergent)) (PP (IN in) (NP (NP (DT the) (NN mind)) (PP (IN of) (NP (NP (NN someone) (RB else)) (: --) (NP (NP (JJ serendipitous) (JJ bastard) (NN offspring)) (PP (IN of) (NP (DT a) (JJ deranged) (JJ cognitive) (NN parent))))))))))))))) (. .))) (S1 (FRAG (RB So) (NP (NP (PRP$ my) (NN gratitude)) (PP (TO to) (NP (PRP you)))) (, ,) (NP (NNP Thaler)) (, ,) (NP (PRP you) (JJ pusillanimous) (NN poseur)) (, ,) (NP (PRP you) (JJ literary) (NN clown)) (. .))) (S1 (NP (DT A) (JJ new) (NN idea) (. !))) (S1 (FRAG (NP (PRP$ My) (NN idea)) (, ,) (NP (NP (DT all) (NN mine)) (PRN (-LRB- -LCB-) (NP (NP (ADJP (JJ accessible) (PP (ADVP (RB here)) (IN on) (NP (NN Language)))) (NN Log)) (PP (TO to) (NP (QP (RB just) (DT a) (JJ few) (CD thousand)) (JJ close) (NNS friends)))) (-RRB- -RCB-))) (. .))) (S1 (FRAG (NP (NNP Ooh)) (, ,) (NP (CD one) (JJ other) (NN thought)) (, ,) (PP (IN for) (NP (JJ computational) (NNS linguists))) (: :) (SBAR (WHNP (WP What)) (S (VP (NNS bets) (PP (IN on) (NP (NP (DT the) (NN performance)) (PP (IN of) (NP (NP (JJ part-of-speech) (VBG tagging) (NNS algorithms)) (PP (IN on) (NP (NN prose))) (PP (JJ such) (IN as) (NP (DT this)))))))))) (. ?)))

Posted by anoop at 03:42 AM

May 12, 2004

Part-of-Speech Tagging 'A Verbless Post'

Geoffrey Pullum, in full TOPIC .. COMMENT form, has posted on the language log, a reasoned critique entitled A Verbless Post of Michael Thaler's verbophobic novel. Pullum's post, of course, contains no verbs, but more to the point for this posting, has the following concluding statement:

Ooh, one other thought, for computational linguists: What bets on the performance of part-of-speech tagging algorithms on prose such as this?

I reached for Adwait Ratnaparkhi's aging but conveniently handy Maximum Entropy part-of-speech tagger and ran it on Pullum's post.

The first thing to notice about the output is the depressing amount of tokenization it takes to make sure that spurious errors do not arise.

Errors? Of course, there are some, but not as many as one would expect. More importantly, the tagger puts its own label biased tongue in its cheek and creates a new verb, to Thaler:

And_CC in_IN that_DT case_NN ,_, a_DT word_NN of_IN gratitude_NN to_TO Thaler_VB -LCB-_-LRB- otherwise_RB an_DT unimportant_JJ screwball_NN -RCB-_-RRB- ._.

Here is the entire output of the tagger on Pullum's post:

A_DT verbless_JJ novel_NN ?_. Why_WRB ?_. ?_. What_WP reason_NN for_IN the_DT accomplishment_NN by_IN this_DT showy_NN fool_NN in_IN France_NNP ,_, Michel_NNP Thaler_NNP ,_, his_PRP$ effort_NN at_IN an_DT entire_JJ novel_NN with_IN no_DT verbs_NNS -LCB-_-LRB- perhaps_RB not_RB a_DT wise_JJ or_CC lucrative_JJ publication_NN venture_NN ,_, given_VBN the_DT not_RB total_JJ incorrectness_NN of_IN my_PRP$ speculations_NNS -RCB-_-RRB- recently_RB evident_JJ amongst_IN the_DT vast_JJ efflux_NN of_IN absurd_JJ literary_JJ pretense_NN in_IN the_DT French_JJ language_NN ?_. Well_UH ,_, whatever_WDT his_PRP$ reasons_NNS ,_, in_IN response_NN ,_, my_PRP$ own_JJ contribution_NN :_: a_DT verbless_JJ post_NN -LCB-_-LRB- the_DT first_JJ on_IN Language_NNP Log_NNP -RCB-_-RRB- ._. No_DT verbs_NNS at_IN all_DT in_IN this_DT book_NN of_IN Thaler_NNP 's_POS ,_, just_RB nouns_NNS ,_, pronouns_NNS ,_, adjectives_NNS ,_, adverbs_NNS ,_, prepositions_NNS ,_, subordinators_NNS ,_, coordinators_NNS ,_, and_CC --_: oh_UH !_. --_: interjections_NNS ._. All_PDT those_DT among_IN the_DT permissible_JJ -LCB-_-LRB- and_CC for_IN him_PRP ,_, past_JJ participles_NNS too_RB ,_, though_IN no_DT participial_JJ intrusions_NNS in_IN this_DT post_NN ,_, such_PDT the_DT extreme_JJ character_NN of_IN my_PRP$ cruel_NN and_CC unreasonable_JJ self-applicable_JJ strictures_NNS -RCB-_-RRB- ,_, but_CC never_RB one_CD single_JJ solitary_JJ verb_NN ._. And_CC ,_, fantastically_RB ,_, all_PDT this_DT a_DT vision_NN of_IN some_DT liberation_NN for_IN authors_NNS ,_, not_RB an_DT absurd_JJ literary_JJ straitjacket_NN with_IN the_DT writer_NN -LCB-_-LRB- albeit_IN willingly_RB -RCB-_-RRB- imprisoned_VBN within_IN it_PRP ._. Some_DT freedom_NN ,_, this_DT ._. Thaler_NNP :_: nuts_NNS ,_, bonkers_NNS ,_, round_VBP the_DT bend_NN ._. Mad_NNP as_IN a_DT March_NNP hare_NN ._. The_DT Liberman_NNP conjecture_NN -LCB-_-LRB- about_IN survival_NN of_IN high_JJ school_NN literary_JJ experimentation_NN into_IN adulthood_NN because_IN of_IN a_DT dysfunctional_JJ authoritarian_JJ French_JJ educational_JJ system_NN -RCB-_-RRB- :_: probably_RB true_JJ ._. My_PRP$ attitude_NN :_: contempt,_NN really_RB ._. Except_IN ..._: Unless_IN ..._: Just_RB possibly_RB ,_, an_DT exercise_NN ,_, for_IN the_DT undergraduates_NN in_IN my_PRP$ course_NN on_IN English_JJ grammar_NN this_DT fall_NN quarter_NN ._. An_DT effort_NN at_IN construction_NN of_IN fifty_JJ words_NNS of_IN coherent_JJ prose_NN with_IN never_RB a_DT verb_NN ,_, with_IN only_RB those_DT in_IN possession_NN of_IN enough_JJ grammatical_JJ knowledge_NN for_IN verb_NN identification_NN capable_JJ of_IN success_NN ._. Worth_JJ a_DT try_NN ,_, perhaps_RB ._. And_CC in_IN that_DT case_NN ,_, a_DT word_NN of_IN gratitude_NN to_TO Thaler_VB -LCB-_-LRB- otherwise_RB an_DT unimportant_JJ screwball_NN -RCB-_-RRB- ._. Always_RB that_DT extra_JJ possibility_NN :_: the_DT idea_NN justifiable_JJ not_RB because_IN of_IN its_PRP$ implementation_NN ,_, but_CC in_IN virtue_NN of_IN a_DT complementary_JJ or_CC counterposed_JJ idea_NN emergent_NN in_IN the_DT mind_NN of_IN someone_NN else_RB --_: serendipitous_JJ bastard_NN offspring_NN of_IN a_DT deranged_VBN cognitive_JJ parent_NN ._. So_IN my_PRP$ gratitude_NN to_TO you_PRP ,_, Thaler_NNP ,_, you_PRP pusillanimous_JJ poseur_NN ,_, you_PRP literary_JJ clown_NN ._. A_DT new_JJ idea_NN !_. My_PRP$ idea_NN ,_, all_DT mine_NN -LCB-_-LRB- accessible_JJ here_RB on_IN Language_NNP Log_NNP to_TO just_RB a_DT few_JJ thousand_CD close_JJ friends_NNS -RCB-_-RRB- ._. Ooh_NNP ,_, one_CD other_JJ thought_NN ,_, for_IN computational_JJ linguists_NNS :_: What_WP bets_VBZ on_IN the_DT performance_NN of_IN part-of-speech_JJ tagging_VBG algorithms_NNS on_IN prose_NN such_JJ as_IN this_DT ?_.

Here is a quick cheat sheet for those who have not yet memorized the Penn Treebank tagset (shame on you!):

CC Coordinating Conjunction CD Cardinal Number DT Determiner IN Preposition JJ Adjective -LRB- Left bracket NN Noun, singular NNP Proper Noun, singular NNS Proper Noun, plural PDT Predeterminer POS Possessive ending PRP Personal pronoun PRP$ Possessive pronoun RB Adverb -RRB- Right bracket TO to UH Interjection VB Verb, base form VBG Verb, past tense VBN Verb, gerund or present partciple VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner WP Wh-pronoun WRB Wh-adverb

Posted by anoop at 10:41 PM

January 22, 2004

Search engine for linguists

Phil Resnik has created a Search engine for linguists. It allows the user to search for particular sentence structures or parse trees. The parses are generated by running Eugene Charniak's statistical parser.

The idea, if I understand it correctly, is to search for structural matches rather than matches on the words. So, for example, if the user was interested in the class of sentences typified by the sentence:

"John ate the meat raw"

Then using the Query page of the Linguist search engine the user could search for the following parse tree (plug your Penn Treebank notation memory module into your brain first):

(VP (VBD ate)(S NP (ADJP JJ)))

According to Philip, in the forum post explaining this query, the first 20 hits include:

Just because they eat it raw doesn't mean that they don't want it fresh. Partial decomposition would be a good alternative for seasoning and tenderising if you have to eat it raw. Eat them broiled, grilled or blackened. All the hypocrisy around me oh God don t let me fall,they might just eat me alive. Eat them smoked, pickled, or cooked. Then the baby Kangaroo's can eat them alive!

Posted by anoop at 01:24 PM

January 21, 2004

gloof, spooce, gloof twain, spooce, gairk

Very cool piece of work brought to my attention by Mark Liberman on the languagelog:

ShortTalk is a speech interface for composing text. Think of it as a "little" programming language that is speech-based and which you can freely intersperse in between normal English speech with some guarantee that the speech recognition algorithms will not freak out on you.

Here's a brief clipping from the webpage that shows how to use ShortTalk to add some space around a "+"-sign

Before

z = x+y|

After

z = x + y|

ShortTalk solution

gloof, spooce, gloof twain, spooce, gairk

Posted by anoop at 04:23 PM

October 24, 2003

Voice Recognition Software Yelled At

From the Onion, vol 39, issue 41 America's Finest News Source(TM) 22 October 2003.

NEW YORK ”Fidelity Financial Services' Gwen Watson, 33, shouted angrily at her IBM ViaVoice Pro USB voice-recognition software, sources close to the human-resources administrator reported Monday. "No, not Gary Friedman! Barry Friedman, you stupid computer. BARRY!" Watson was heard to scream from her cubicle. "Jesus Christ, I could've typed it in a hundredth of the time." After another minute of yelling, Watson was further incensed upon looking at her screen, which read, "Barely Freedman you God ram plucking pizza ship."

Hmm. When good language models go bad ...

Posted by anoop at 01:01 PM

October 02, 2003

Famous researchers and "Work at google" ads

Not so long ago, if you searched for "machine learning" or "computational linguistics" in google, you would get an ad (so-called Sponsored Link) from google: the "Work at google" ad.

Not so well known was that due to the inherent (or explicit) clustering over queries, you would also get this ad if you searched for a particular name. The names were usually of famous computational linguists or machine learning people. For example, if you searched for fernando pereira you would get an ad asking whether you would like to "Work at google". Unfortunately, Fernando's name no longer triggers the ad.

But some other names still do. The following is only a partial list of names that trigger "Work at google" ads:

Work at google Ads

  • dekai wu
  • andrew mccallum
  • yaov freund
  • daniel marcu
  • vladimir vapnik
  • soumen chakrabarti
  • david yarowsky

A small variation is a more specific ad which targets NLP searchers:

Work on NLP at google (with the blurb "google is hiring experts in statistical natural language processing")

  • aravind joshi
  • fred jelinek
  • robert schapire
  • dan jurafsky
  • steven abney
  • stuart shieber

What is equally suprising is that other names that you might think of as being in this class do not trigger the same ad. So what is the key that distinguishes these people from other, arguably just-as-famous researchers?

Posted by anoop at 02:42 PM