March 20, 2006

Frederick Jelinek's Acceptance Speech

On November 22nd, 2001, Charles University in Prague presented an Honorary Doctorate to Prof. Frederick Jelinek. I would like to point you in the direction of his acceptance speech at this occasion (translated into English from the original Czech). Here's an excerpt to pique your interest:

By now it is certainly obvious that my life was full of detours and compromises. Opportunities for success were afforded me in the United States, the land of freedom, land of work, land of civilization, to quote an old song of the famous Czech clowns Voskovec and Werich.

Posted by anoop at 12:19 PM

February 01, 2006

Gung Haggis Fat Choy

gunghaggisfatchoy.jpg
(Image from roland's flickr set)

A honest to goodness Vancouver tradition that originated in Simon Fraser University. The annual celebration of Scottish and Chinese culture: Gung Haggis Fat Choy.

Taken from the official web site for Gung Haggis Fat Choy here's a slightly edited version of the history of this annual holiday celebration (only celebrated in Vancouver, as far as I can tell):

It was on Burnaby Mountain, at Simon Fraser University that mild-mannered psychology student and SFU tour guide, Todd Wong was asked to help out with the University's annual Robbie Burns celebrations. ... Wong was befuddled with the idea of a Chinese guy (him) wearing a Scottish kilt and having to show his bare knees out in the snow.

... the Chinese Lunar New Year fell on January 27th only two days away from Robbie Burns Day, which is always January 25th in celebration of the Scottish Bard's birthday. "Gung Haggis Fat Choy!" said Wong, "I can celebrate two cultures at the same time."

Flash forward to 1998, and Wong was putting together a Chinese New Year Dinner party for about 12 friends. Lo and Behold, the Lunar New Year again fell two days away from January 25th, Robbie Burns Day. Dinner plans were quickly made to incorporate both Chinese New Year and Robbie Burns Day traditions as Wong scurried off to the Vancouver Public Library to research Robbie Burns Day and discover Scottish songs for himself to play on his accordion.

A dinner of 16 in a friend's living room was the setting for the first Robbie Burns Chinese New Year dinner hosted by Toddish McWong, along with co-host Gloria Smyth. Todd cooked and organized most of the dishes. Gloria hired the bagpiper. They invited their friends. Fiona brought a haggis. Margot toasted the lads and lassies.

Posted by anoop at 11:29 PM

November 14, 2005

Korean History Group Blog

A new group blog called 우물 안 개구리 (umul an geguri or Frog in a Well; frog=geguri, well=umul) explores Korean history and is being run by Konrad Lawson. The weblog's name is explained therein:

The weblog’s name 우물 안 개구리 is originally from a Chinese proverb that comes from the writings of Zhuangzi, one of the founders of what we now call Daoism (In the Burton Watson translation of his Basic Writings the story behind this proverb can be found in Section 17 “Autumn Floods” on pages 107-8). A frog tries to convince a turtle to join him in his wonderful well, of which he is a master. After trying to get in and getting stuck, the turtle withdraws and tells the frog instead of how deep and wide the sea is. The frog is left dumfounded. The proverb, which grew out of this Daoist fable, has come to represent a state of limited vision and even ignorance — of not being able to see outside one’s own immediate environment.

The name was also chosen to be the same as previous group blogs on Japanese history (井の中の蛙) and Chinese history (井底之蛙).

Posted by anoop at 02:45 PM

November 09, 2005

When software bugs attack

Wired News has collected a list of legendary software bugs (local copy).

Here is an important caveat from the article:

Many people believe the worst bugs are those that cause fatalities. To be sure, there haven't been many, but cases like the Therac-25 are widely seen as warnings against the widespread deployment of software in safety critical applications. Experts who study such systems, though, warn that even though the software might kill a few people, focusing on these fatalities risks inhibiting the migration of technology into areas where smarter processing is sorely needed. In the end, they say, the lack of software might kill more people than the inevitable bugs.

This is presented as an all-or-nothing argument. It is probably rare that smarter processing is so crucial that those using the technology should not insist on being skeptical and really trusting the software they use. This evokes Sean Eddy's note in PLOS Comp. Bio. about ``inter-disciplinary'' research: which should not be construed as the gathering of people from different disciplines, but individual people learning and eventually confident of doing research in multiple disciplines.

Posted by anoop at 09:42 AM

October 30, 2005

Respect

That's what we need: some Respect!

David recently posted a link to a Computer magazine article on the 2005 NIST Machine Translation evaluation results and the status of contemporary statistical machine translation in an article called Statistical Machine Translation Gains Respect.

The article does a good job at summarizing interviews with a representative from each of the teams that participated in NIST MTEval 2005. Google, IBM, UMD, NRC, CMU and Edinburgh get interviewed. ISI and Aachen are noticably missing from the list of interviews.

There is the usual quota of amusing stuff that you find in popular science articles: watch out for the most awkward definition of n-gram yet, and don't miss a quote from a famous Edinburgh researcher saying that "translating to Chinese require systems to have high levels of linguistic knowledge". And breaking news: ``Google has several other [SMT] applications in mind, but won't comment further''.

Nobody else might find it interesting that the article's author, David Geer, lives in Ashtabula, Ohio.

Posted by anoop at 01:57 PM

July 20, 2005

The Tragic Tale of a Genius by Freeman Dyson

500px-NorbertWiener.jpg
© Image courtesy of the Research Laboratory of Electronics at MIT.

In The Tragic Tale of a Genius Freeman Dyson (published in the New York Review of Books, Volume 52, Number 12, July 14, 2005) reviews Dark Hero of the Information Age: In Search of Norbert Wiener, the Father of Cybernetics by Flo Conway and Jim Siegelman (Basic Books). (temporary url)

His review also includes information from Norbert Wiener's two autobiographies: Ex-prodigy: My Childhood and Youth (Simon and Schuster, 1953) and I Am a Mathematician: The Later Life of a Prodigy (Doubleday, 1956).

In academic computer science departments there is often a science/engineering split, where one side of the split prove theorems, and the other side build systems to solve 'real-world' problems. Some of the most admired computer scientists like Alan Turing, Don Knuth and Norbert Wiener (to name a few) teach us how to bridge this gap. From Dyson's review:

Wiener was unusual among mathematicians in being equally at home in pure and applied mathematics. He made his reputation as a pure mathematician by inventing concepts such as the "Wiener measure" that have passed into the mainstream of mathematics. Wiener measure gave mathematicians for the first time a rigorous way to talk about the collective behavior of wiggly curves or flexible surfaces. While continuing to publish papers in the abstract realms of mathematical logic and analysis, he loved to talk with the engineers and neurophysiologists who were his neighbors at MIT and Harvard. He became deeply immersed in their cultures, and enjoyed translating problems from the languages of engineering and neurophysiology into the language of mathematics.

Unlike most pure mathematicians, he did not consider it beneath his dignity to apply his skills to the messy practical problems of the real world. He understood, more clearly than anyone else, that the messiness of the real world was precisely the point at which his mathematics should be aimed.

Norbert Wiener is best known as the founder of cybernetics:

As an applied mathematician, he worked out a general theory of control systems and feedback mechanisms, a theory which he called "cybernetics." Cybernetics was a theory of messiness, a theory that allowed people to find an optimum way to deal with a world full of poorly known agents and unpredictable events.

In 1940 he wrote a memorandum explaining in detail why digital language would be preferable for the computers whose existence he already foresaw. But his own contributions to communication theory happened to be written in analog language, for four reasons. First, his work as a pure mathematician had mostly been in analysis. Second, his practical experience with antiaircraft prediction was concerned with analog measurements and analog feedback mechanisms. Third, his conversations with neurophysiologists had convinced him that the language of sensory-motor feedback signals in the brains of humans and animals is analog. Fourth, the transmission of signals by chemical hormones is evidence that the action of the brain is at least partly analog. For all these reasons, Wiener's book Cybernetics, which summarized his thinking in 1948, was written in analog language.

Meanwhile, also in 1948, Claude Shannon published his classic pair of papers with the title "A Mathematical Theory of Communication," ... [It] was mathematically elegant, clear, and easy to apply to practical problems of communication. It was far more user-friendly than cybernetics. It became the basis of a new discipline called "information theory." ... Electronic engineers learned information theory, the gospel according to Shannon, as part of their basic training, and cybernetics was forgotten.

But Wiener was not ignored everywhere. His theories had wide circulation in India and Russia, and he was welcomed personally by Nehru and other leaders in India. Wiener did advocate founding of technical institutes and the encouragement of home-grown technical industries, but I find Dyson's claim that this is why India and to some extent Russia is now strong in information technology as too simplistic.

Dyson also compares this new book about Wiener with two previous biographies "John von Neumann and Norbert Wiener: From Mathematics to the Technologies of Life and Death" by Steve Heims (MIT Press, 1980) and Norbert Wiener, 1894–1964 by Pesi Masani (Birkhäuser, 1990). About the Heims book, Dyson says:

The Heims biography emphasizes politics. It is mainly concerned with Wiener's activities as a social critic in the last third of his life. It presents the parallel lives of von Neumann and Wiener as a simple struggle between black and white... In a review of the Heims book which I published in Technology Review in 1981, [February/March issue, pp. 17–19] I wrote:

If Heims had been willing [to stay in the background], to present his work as a historical narrative with the protagonists speaking for themselves, he would have made an important contribution to the understanding of the great moral dilemma of our age. Unfortunately, ... he stands at the front of the stage between his characters and the audience, making it difficult for us to hear their voices and to see the drama of their lives [in historical perspective].

And about Masani's book, Dyson writes:

Pesi Masani's biography is from a scholarly point of view the best of the three. Masani was a professional mathematician, born in India and settled in the United States. He collaborated with Wiener and published several substantial papers with him in the 1950s. After Wiener died, Masani edited his collected papers for publication. ... The Masani biography is the only one that portrays him as a working mathematician.

Masani explains Wiener's mathematical ideas with admirable clarity, and he has found and reproduced many historical documents that the other biographers have missed. One particularly illuminating document that Masani reproduces in full is a long and friendly letter from von Neumann to Wiener, written in November 1946, discussing the mysteries of the human brain and the various ways in which the mysteries might be explored. ... Von Neumann's letter shows how far he had come in foreshadowing the era of molecular biology that he never lived to see. The letter also shows how far Heims diverged from the truth when he portrayed von Neumann and Wiener as polar opposites. They shared a passionate interest in biology. Both of them saw a deeper understanding of biology as the ultimate goal of their explorations of the science of computing and information.

So after two biographies, why a new one? As Dyson says:

After Heims has described Wiener's politics and Masani has described his mathematics, what is there left for a third biography to do? This third biography give us a new and intimate portrait of Wiener as a person, and describes his stormy relationships with his friends and family. ... Their aim is to explore the roots of Wiener's lifelong malaise and often weird behavior.

Wiener's personal life was marred by several problems, some of them perhaps because of his genius:

The drama of Wiener's personal life begins with his years as an infant prodigy, tormented by his brilliant but tyrannical father. Either as a result of his father's training or from genetic predisposition, he suffered from violent swings of mood that continued throughout his life. ...

Another major theme of this biography is Wiener's marriage. His wife, Margaret, was a student of his father, and the marriage was arranged by his parents. Margaret was chosen to take over from his parents the job of caring for him and organizing his life. ... She coped with his moods and raised his daughters.

But Margaret was in some respects even crazier than Wiener. She had emigrated from Germany to America at the age of fourteen. She was a fervent admirer of Adolf Hitler and kept two copies of Mein Kampf displayed prominently in her bedroom, one in German and one in English. She made no secret of her political views, to the intense annoyance of Wiener, who was himself Jewish and had many friends who were victims of Nazi persecution. When the daughters were teenagers and began to acquire boyfriends, she made their lives miserable by accusing them of nonexistent sexual delinquencies. ... As a result of her paranoid accusations, both daughters escaped from home as soon as they could and thereafter had little contact with her or with Wiener.

The most tragic episode of Wiener's life happened in 1951 when he was fifty-seven years old and passionately involved in a collaboration with his friend Warren McCullough and a group of young colleagues that he called "the boys." ... Margaret was insanely jealous of McCullough and his boys, and resolved to break up their friendship with Wiener ... she informed Wiener that McCullough's boys had seduced his daughter Barbara when she was a teenager staying at McCullough's house. This story had no basis in fact, but Wiener believed it ... and immediately wrote an angry letter to the president of MIT dissolving all connection between himself and the McCullough team.

Dyson tries to find some balance in this story:

Margaret is now the one who is accused and will never have a chance to answer her accusers. She never spoke with the authors, and left no friend behind to speak for her. The evidence against her is well documented and seems convincing. And still, the reviewer wonders.

Wiener is in many ways a forgotten hero of computer science. I certainly have not read any of Wiener's books on cybernetics: and nobody in contemporary AI seems to bother to read them either. The digital-analog war is pretty much over and no prizes for guessing which side won: Shannon's theories seem highly relevant for research in AI and machine learning today while Wiener's theories are, for better or for worse, left behind.

If you've read this far, you might want to read my review of Steve Heims' biography of Norbert Wiener which touches on Wiener's ethical ideas on responsible behaviour as a researcher. That book was also reviewed by Rudolf Peierls in Odd Couple (New York Review of Books, Volume 29, Number 2, February 18, 1982).

Posted by anoop at 01:30 PM

July 05, 2005

Do the other things

kennedy_rice.png

If you have watched any documentary on the Apollo space program, you've heard (and seen) the following excerpt from John F. Kennedy's address delivered at Rice University, Sept 12, 1962.

We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too.

I have always been bothered by this excerpt because the referent for the other things is almost never included in the video or audio clip (I can't think of a single documentary which includes the referent). So what does the other things refer to? Is it unambiguous?

The answer is clear from watching video footage of the entire speech. Here is the paragraph that precedes the previous paragraph in the transcript of Kennedy's speech:

There is no strife, no prejudice, no national conflict in outer space as yet. Its hazards are hostile to us all. Its conquest deserves the best of all mankind, and its opportunity for peaceful cooperation many never come again. But why, some say, the moon? Why choose this as our goal? And they may well ask why climb the highest mountain? Why, 35 years ago, fly the Atlantic? Why does Rice play Texas?

We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard ...

So it seems that the referent for the other things is the set: { "climb the highest mountain", "fly the Atlantic 35 years ago", "Rice playing Texas" } . Apart from the last item, it is quite easy to grasp Kennedy's comparison. The same set is also presumably the referent for the second case of anaphora: one which we intend to win, and the others, too. Climbing the Everest and transatlantic flight are clear analogies for going to the moon, but the 'Rice playing Texas' item requires some further explanation.

Here is what Bill Little has to say about the Rice-Texas football rivalry in an article published on 9/24/2004:

It began 90 years ago, when Rice, playing in only its third football season, lost to a Texas team that included six players who would enter the Longhorn Hall of Honor after it was started more than 40 years later.

They were legendary names, folks like Louis Jordan, the team captain, and Gen. K. L. Berry, Pig Dittmar and Clyde Littlefield. And that was only the beginning.

A year later, Rice and Texas met on October 16, 1915, in the Longhorns' first game in a new league alignment called the Southwest Conference. For 82 years, from that beginning season in 1914 through 1995, the two schools played every year. In its time, it was longest continuous streak of any Longhorn opponent.

Texas controlled the series in the early years, but the fledgling Owls did post a notable win in 1924 under their new coach, a guy named John W. Heisman (for whom the famous trophy is named). But beginning in 1930, the series between the university on South Main in Houston and the guys from the Forty Acres in Austin was second only to Texas A&M as the Longhorns' biggest rivalry until the mid 1960s.

In 1937, Texas hired D. X. Bible, and Rice followed in 1940 with the hiring of Jess Neely. Heisman not withstanding, the two coaches brought credibility and respectability to both the game and the coaching profession that was unsurpassed.

From 1930 through Neely's final win over Texas in 1965, Rice actually held the edge in the series, 18-17-1. In 1957, Darrell Royal took the Texas job, and he would go on to become the fourth member of the prestigious College Football Hall of Fame to coach in the series.

Royal was the winningest coach in Southwest Conference history. Neely finished tied for second in a career that spanned 26 years.

For years, the Rice-Texas game was the social event of the football season, and when the Owls opened their state-of-the-art stadium in the mid-1950s, it was usually packed with 70,000 folks for the meeting with Texas.

The series also took on an unusual quality. From 1954 until the Longhorns snapped the string with a victory in Houston in 1964 and Rice returned the favor by winning in Austin in 1965, the home team won. The only exception was a 14-14 tie in 1962, when a heavy underdog Rice team knocked Texas from its spot as the No. 1 team in the nation. Otherwise, Rice won in Houston, and Texas won in Austin.

But beginning with Neely's final season of 1966, Texas reeled off 28 straight victories until Rice ended the streak on a rainy Sunday night in Houston in 1994.

Presumably, the comparison of the Apollo program to a Rice-Texas football rivalry was due to the unlikely victory of the Owls against the Longhorns in 1962. If you watch the entire video footage closely, you will notice that Kennedy, every bit the accomplished public speaker, gets the loudest applause just after his line: "Why does Rice play Texas?". On the video footage, watch closely for the cigar smoking man just to Kennedy's right for a good example of the crowd's reaction to Kennedy's comparison of the Moon missions with the Rice-Texas football rivalry.

Posted by anoop at 04:09 PM

June 23, 2005

Do Philip K. Dick Androids Dream of Electric Sheep?

PKD-A-sculpture-4-14-05.jpg

WIRED Magazine's NextFest at the Navy Pier in Chicago, IL (Jun 24-26) will feature a Philip K. Dick android.

The robot will portray Dick in both form and intellect through an artificial-intelligence-driven personality. The hardware will manipulate Hanson's proprietary lifelike skin material to affect extremely realistic expressions with very low power. Cameras in the eyes will allow the robot to perceive people's identity and behavior through advanced machine vision and biometric-identification software. The robot will track faces, perceive facial expressions, and recognize people from the crowd (family, friends, celebrities, etc). The visual data will be fused with some of the best speech recognition software, advanced natural language processing, and speech synthesis in the world. All of this will run in sync with Hanson Robotics' highly expressive robot face to emulate a full human-conversational system.

According to the Hanson Robotics web page above, the speech recognition and advanced NLP promised above is licensed from Multimodal Technologies and the text to speech system is licensed from Acapela. Multimodal is a company which seems to have strong ties with the CMU NLP research groups (Alex Waibel is on the Board of Directors). I wonder if the Philip K. Dick android has a GLR parser in there somewhere.

Linked from Boing Boing.

Posted by anoop at 01:02 PM

April 15, 2005

The Viterbi Algorithm

G. David Forney Jr. recently posted on arXiv an extremely interesting article on the history and development of the Viterbi algorithm: The Viterbi Algorithm: A Personal History.

The Viterbi algorithm originated as a decoding algorithm for convolution codes. However, since then it's utility has been widespread. In particular, in speech recognition and computational linguistics, the Viterbi algorithm is used for decoding the most likely state sequence and it is also used in the Forward-Backward algorithm in Hidden Markov Models.

The interesting points from the above article:

  • Viterbi devised the algorithm to help him teach:
    the Viterbi algorithm for convolution codes ... came out of my teaching ... I found information theory difficult to teach, so I started developing some tools.
  • The Viterbi algorithm, when first published, was not known to be related to dynamic programming methods and also not known to provide the optimal or maximum likelihood solution. The original paper states that:
    this decoding algorithm is clearly suboptimal
    It was G. D. Forney, Jr. who later proved that the Viterbi algorithm was an exact recursive algorithm for the shortest path through a trellis diagram. The relationship to dynamic programming then became clear.

The article also provides various places where the Viterbi algorithm has been used in practice, including the Galileo mission to Jupiter in 1992 (it was used to boost the transmission bandwidth when the primary antenna failed to deploy).

Of course, nowadays there are many applications of Viterbi in Computational Linguistics where it is used for many sequence learning tasks, from finding person names or gene names in text, to word segmentation in languages like Chinese, and in Biological Sequence Analysis where it is used to find exon or intron boundaries in DNA sequences.

The article also mentions various relationships between algorithms for "codes on graphs" and Pearl's belief propogation algorithm for Bayesian networks. The following paper is a good reference on this topic (this paper is cited in the above article, but was first pointed out to me by Hassan Ait-Kaci):

S. M. Aji and R. J. McEliece, "The generalized distributive law," IEEE Trans. Inform. Theory, vol. 46, pp. 325-343, Mar. 2000.

Posted by anoop at 02:05 PM

November 24, 2004

Sine

From Passage to China by Amartya Sen published in the New York Review of Books, Volume 51, Number 19, December 2, 2004. Note that the link to the full article will probably disappear into a paid archive eventually.

Footnote [4]

An interesting example of the transmission of mathematical ideas and terms can be seen in the origin of the trigonometric term "sine." In his Sanskrit mathematical treatise completed in 499 AD, Aryabhata used jya-ardha (Sanskrit for "chord half"), shortened later into jya, for what we now call "sine." Arab mathematicians in the eighth century transliterated the Sanskrit word jya into the proximate sound of jiba and then later changed it to jaib (with the same consonants as jiba), which is a good Arabic word, meaning a bay or a cove, and it was this word that was later translated by Gherardo of Cremona (circa 1150) into its equivalent Latin word for a bay or a cove, viz., sinus, from which the modern term "sine" is derived. See Howard Eves, An Introduction to the History of Mathematics, (Saunders, sixth edition, 1990), p. 237. Aryabhata's jya was translated into Chinese as ming and was used in such tables as yue jianliang ming, literally "sine of lunar intervals." See Jean-Claude Martzloff, A History of Chinese Mathematics (Springer, 1997), p. 100.

Posted by anoop at 03:31 PM

November 12, 2004

Francis Crick on Philosophy of Science

crick200.jpg

From V.S. Ramachandran's eulogy, The Astonishing Francis Crick, by way of Kitabkhana.

He had very little patience with orthodox philosophers. He felt they became too prematurely trapped in matters of terminology. I am reminded of a seminar on consciousness he gave at the Salk in the eighties. A philosopher—whose name politeness forbids me from mentioning—raised his hand and said "But Dr Crick … you are attempting to solve the so-called problem of consciousness yet you haven't even bothered to define it...can you clearly define what you are talking about?" Crick's reply: "My dear chap, there was never a time in the pre-DNA era when a lot of us biologists sat around the table and said 'Let us first clearly define life before we explore it'. We just went out there, forged ahead and found out what it was. It's no doubt good to have a rough idea of what one is talking about but matters of terminology are best left to philosophers who spend most of their time on such things. Indeed clear definitions often emerge from empirical research. We now no longer quibble over questions like is a virus really alive". Semantic hygiene, Crick felt, was largely a waste of time. ...

Posted by anoop at 03:26 PM

September 13, 2004

Graphviz Introduction

There is a good introduction to graphviz, the graph drawing tool from AT&T available at the Linux Journal web site.

Graphviz provides a general tool to visualize objects that are otherwise hard to see. One example of how graphviz can be used is in visualizing a forest which is a compact representation of a whole bunch of trees. It is compact because it does not duplicate common sub-trees. The figure below is one such forest that stores four simple trees (click on the figure to get a larger view).

It is a somewhat unorthodox view of a forest because entire (sub)trees are shown at each node instead of just non-terminals, so the forest as shown has some duplicated nodes (e.g. the four original trees) but it looks prettier. The figure above was produced by running some simple Perl code that I hacked together to convert a set of trees into a forest and store it in a format that can be read by graphviz tools.

For the original source code and binary distributions go to the Official GraphViz Web Site and the GraphViz Development Web Site. There is a very sophisticated native port to MacOSX.

There is a convenient Perl interface to graphviz available from CPAN and there is a C++ STL-style interface to graphviz that is part of the Boost library. There's even a MATLAB interface to graphviz.

Posted by anoop at 10:56 AM

June 30, 2004

Axis of Evil

A gem of geek humor stolen from a post on Ernie's 3D Pancakes:

From "Shape Fitting with Outliers" by Sariel Har-Peled and Yusu Wang, SIAM J. Computing 33(2): 269–285, 2004.

DEFINITION 3.2. A set of hyperplanes I is a δ-sheaf if there exists a vertical segment s of length δ such that all the hyperplanes in I stab s, The vertical segment s is the axis of I. [Footnote: We will refer to it as the axis of evil when appropriate.]

So what does the axis of evil look like?

This shows a δ-sheaf in RxR with pq as the axis of evil.

Posted by anoop at 01:37 PM

May 17, 2004

Parsing 'A Verbless Post'

This is getting a bit ridiculous, but here goes:

A follow up to a previous post about Part-of-speech Tagging 'A Verbless Post' in which Geoff Pullum's post to the language log was analyzed for parts of speech. This post uses Eugene Charniak's statistical parser (parser03) to produce a syntactic analysis of the contents (in the Penn Treebank notation).

First thing to notice in the parser output is that the recall for humourous points scored is substantially reduced due to the fact that no verb to Thaler is produced:

(S1 (S (CC And) (PP (IN in) (NP (DT that) (NN case))) (, ,) (NP (NP (DT a) (NN word)) (PP (IN of) (NP (NP (NN gratitude)) (PP (TO to) (NP (NNP Thaler)))))) (VP (PRN (-LRB- -LCB-) (ADVP (RB otherwise)) (NP (DT an) (JJ unimportant) (NN screwball)) (-RRB- -RCB-))) (. .)))

However, overall the poor parser is strained by the lack of verbs more than the tagger seemed to be, mainly due to the added pressure of producing legitimate syntactic structures. Because verb phrases occur frequently in the training data, the parser produces structures with spurious VPs in some unfamiliar contexts:

(S1 (S (NP (IN Except)) (VP (VBZ ..)) (. .)))

and:

(VP (VBZ nouns) (NP (, ,) (NNS pronouns) ... )

Our experience in trying to parse the output of a statistical machine translation system on the NIST 02/03 data for Chinese to English translation led to similar issues of hallucinated verb phrases for some of the ungrammatical English sentences output by the system. This behaviour is documented in this paper (from HLT-NAACL, 2004).

Understanding the notation of these parse trees is likely to be more challenging for the layperson (I would hope). For the intrepid reader, a good start would be the Penn Treebank manuals.

If you examine the full output of the Charniak parser on Geoff Pullum's post (shown below), there are some strange errors in punctuations, and the usual prepositional phrase (PP) and coordination (CC) attachment errors. But, overall, the performance is very good, especially for some useful constituents like noun phrases (NPs) or parentheticals (PRN).

(S1 (NP (DT A) (JJ verbless) (NN novel) (. ?))) (S1 (FRAG (WRB Why) (. ?) (. ?))) (S1 (NP (NP (WP What) (NN reason)) (PP (IN for) (NP (NP (DT the) (NN accomplishment)) (PP (IN by) (NP (NP (DT this) (JJ showy) (NN fool)) (PP (IN in) (NP (NP (NNP France)) (, ,) (NP (NNP Michel)))))))))) (S1 (FRAG (NP (NNP Thaler)) (, ,) (NP (NP (PRP$ his) (NN effort)) (PP (IN at) (NP (NP (DT an) (JJ entire) (NN novel)) (PP (IN with) (NP (DT no) (NNS verbs))))) (PRN (-LRB- -LCB-) (NP (RB perhaps) (RB not) (NP (DT a) (ADJP (JJ wise) (CC or) (JJ lucrative)) (NN publication) (NN venture)) (, ,) (VP (VBN given) (NP (NP (DT the) (RB not) (JJ total) (NN incorrectness)) (PP (IN of) (NP (PRP$ my) (NNS speculations)))))) (-RRB- -RCB-)) (ADJP (RB recently) (JJ evident))) (PP (IN amongst) (NP (NP (DT the) (JJ vast) (FW efflux)) (PP (IN of) (NP (NP (JJ absurd) (JJ literary) (NN pretense)) (PP (IN in) (NP (DT the) (JJ French) (NN language))))))) (. ?))) (S1 (FRAG (INTJ (UH Well)) (, ,) (SBAR (WHNP (WDT whatever)) (S (NP (PRP$ his) (NNS reasons)) (, ,) (PP (IN in) (NP (NN response))) (, ,) (NP (PRP$ my) (JJ own) (NN contribution)) (: :) (NP (NP (DT a) (JJ verbless) (NN post)) (-LRB- -LCB-) (NP (NP (DT the) (JJ first)) (PP (IN on) (NP (NN Language) (NN Log)))) (-RRB- -RCB-)))) (. .))) (S1 (S (NP (NP (DT No) (NNS verbs)) (PP (IN at) (NP (NP (DT all)) (PP (IN in) (NP (NP (DT this) (NN book)) (PP (IN of) (NP (NP (NNP Thaler) (POS 's)) (, ,) (ADVP (RB just))))))))) (VP (VBZ nouns) (NP (, ,) (NNS pronouns) (, ,) (NNS adjectives) (, ,) (NNS adverbs) (, ,) (NNS prepositions) (, ,) (NNS subordinators) (, ,) (NNS coordinators) (, ,) (CC and) (PRN (: --) (INTJ (UH oh) (. !)) (: --)) (NNS interjections))) (. .))) (S1 (S (NP (PDT All) (DT those)) (PP (IN among) (NP (DT the) (JJ permissible) (PRN (-LRB- -LCB-) (CC and) (PP (IN for) (NP (PRP him))) (, ,)) (NN past))) (VP (VBZ participles) (ADVP (RB too)) (, ,) (PP (IN though) (NP (NP (DT no) (JJ participial) (NNS intrusions)) (PP (IN in) (NP (DT this) (NN post))))) (, ,) (NP (NP (NP (PDT such) (DT the) (JJ extreme) (NN character)) (PP (IN of) (NP (PRP$ my) (ADJP (JJ cruel) (CC and) (JJ unreasonable)) (JJ self-applicable) (NNS strictures) (-RRB- -RCB-)))) (, ,) (CC but) (RB never) (NP (CD one) (JJ single) (JJ solitary) (NN verb)))) (. .))) (S1 (S (CC And) (, ,) (ADVP (RB fantastically)) (, ,) (NP (PDT all) (DT this)) (VP (NP (NP (NP (DT a) (NN vision)) (PP (IN of) (NP (NP (DT some) (NN liberation)) (PP (IN for) (NP (NNS authors)))))) (, ,) (RB not) (NP (NP (DT an) (JJ absurd) (JJ literary) (NN straitjacket)) (PP (IN with) (NP (DT the) (NN writer))))) (PRN (-LRB- -LCB-) (PP (IN albeit) (NP (RB willingly))) (-RRB- -RCB-)) (VP (VBN imprisoned) (PP (IN within) (NP (PRP it))))) (. .))) (S1 (NP (NP (DT Some) (NN freedom)) (, ,) (NP (DT this)) (. .))) (S1 (FRAG (NP (NNP Thaler)) (: :) (S (NP (NNS nuts) (, ,) (NNS bonkers) (, ,)) (VP (VBP round) (DT the) (VP (VB bend)))) (. .))) (S1 (NP (NP (JJ Mad)) (PP (IN as) (NP (DT a) (NNP March) (NN hare))) (. .))) (S1 (S (NP (DT The) (NNP Liberman) (NN conjecture)) (PRN (-LRB- -LCB-) (PP (IN about) (NP (NP (NN survival)) (PP (IN of) (NP (NP (JJ high) (NN school) (JJ literary) (NN experimentation)) (PP (IN into) (NP (NP (NN adulthood)) (PP (IN because) (IN of) (NP (DT a) (ADJP (JJ dysfunctional) (JJ authoritarian)) (JJ French) (JJ educational) (NN system))))))))) (-RRB- -RCB-)) (: :) (S (ADVP (RB probably)) (ADJP (JJ true))) (. .))) (S1 (NP (NP (PRP$ My) (NN attitude)) (: :) (NP (NP (NN contempt)) (, ,) (ADVP (RB really))) (. .))) (S1 (S (NP (IN Except)) (VP (VBZ ..)) (. .))) (S1 (FRAG (PP (IN Unless) (NP (CD ..))) (. .))) (S1 (S (ADVP (RB Just) (RB possibly)) (, ,) (NP (NP (DT an) (NN exercise)) (, ,) (PP (IN for) (NP (NP (DT the) (NNS undergraduates)) (PP (IN in) (NP (NP (PRP$ my) (NN course)) (PP (IN on) (NP (NNP English)))))))) (VP (NN grammar) (NP (DT this) (NN fall) (NN quarter))) (. .))) (S1 (NP (NP (DT An) (NN effort)) (PP (IN at) (NP (NP (NN construction)) (PP (IN of) (NP (NP (JJ fifty) (NNS words)) (PP (IN of) (NP (NP (JJ coherent) (NN prose)) (PP (IN with) (NP (NP (ADVP (RB never)) (DT a) (NN verb)) (, ,) (PP (IN with) (NP (NP (RB only) (DT those)) (PP (IN in) (NP (NP (NN possession)) (PP (IN of) (NP (NP (JJ enough) (JJ grammatical) (NN knowledge)) (PP (IN for) (NP (NP (JJ verb) (NN identification)) (ADJP (JJ capable) (PP (IN of) (NP (NN success)))))))))))))))))))) (. .))) (S1 (FRAG (ADJP (JJ Worth) (S (NP (DT a) (NN try)))) (, ,) (ADVP (RB perhaps)) (. .))) (S1 (S (CC And) (PP (IN in) (NP (DT that) (NN case))) (, ,) (NP (NP (DT a) (NN word)) (PP (IN of) (NP (NP (NN gratitude)) (PP (TO to) (NP (NNP Thaler)))))) (VP (PRN (-LRB- -LCB-) (ADVP (RB otherwise)) (NP (DT an) (JJ unimportant) (NN screwball)) (-RRB- -RCB-))) (. .))) (S1 (FRAG (NP (RB Always) (DT that) (JJ extra) (NN possibility)) (: :) (S (NP (DT the) (NN idea)) (VP (VBP justifiable) (PP (RB not) (PP (IN because) (IN of) (NP (PRP$ its) (NN implementation))) (, ,) (CC but) (PP (IN in) (NP (NP (NN virtue)) (PP (IN of) (NP (NP (DT a) (ADJP (JJ complementary) (CC or) (JJ counterposed)) (NN idea) (NN emergent)) (PP (IN in) (NP (NP (DT the) (NN mind)) (PP (IN of) (NP (NP (NN someone) (RB else)) (: --) (NP (NP (JJ serendipitous) (JJ bastard) (NN offspring)) (PP (IN of) (NP (DT a) (JJ deranged) (JJ cognitive) (NN parent))))))))))))))) (. .))) (S1 (FRAG (RB So) (NP (NP (PRP$ my) (NN gratitude)) (PP (TO to) (NP (PRP you)))) (, ,) (NP (NNP Thaler)) (, ,) (NP (PRP you) (JJ pusillanimous) (NN poseur)) (, ,) (NP (PRP you) (JJ literary) (NN clown)) (. .))) (S1 (NP (DT A) (JJ new) (NN idea) (. !))) (S1 (FRAG (NP (PRP$ My) (NN idea)) (, ,) (NP (NP (DT all) (NN mine)) (PRN (-LRB- -LCB-) (NP (NP (ADJP (JJ accessible) (PP (ADVP (RB here)) (IN on) (NP (NN Language)))) (NN Log)) (PP (TO to) (NP (QP (RB just) (DT a) (JJ few) (CD thousand)) (JJ close) (NNS friends)))) (-RRB- -RCB-))) (. .))) (S1 (FRAG (NP (NNP Ooh)) (, ,) (NP (CD one) (JJ other) (NN thought)) (, ,) (PP (IN for) (NP (JJ computational) (NNS linguists))) (: :) (SBAR (WHNP (WP What)) (S (VP (NNS bets) (PP (IN on) (NP (NP (DT the) (NN performance)) (PP (IN of) (NP (NP (JJ part-of-speech) (VBG tagging) (NNS algorithms)) (PP (IN on) (NP (NN prose))) (PP (JJ such) (IN as) (NP (DT this)))))))))) (. ?)))

Posted by anoop at 03:42 AM

May 12, 2004

Part-of-Speech Tagging 'A Verbless Post'

Geoffrey Pullum, in full TOPIC .. COMMENT form, has posted on the language log, a reasoned critique entitled A Verbless Post of Michael Thaler's verbophobic novel. Pullum's post, of course, contains no verbs, but more to the point for this posting, has the following concluding statement:

Ooh, one other thought, for computational linguists: What bets on the performance of part-of-speech tagging algorithms on prose such as this?

I reached for Adwait Ratnaparkhi's aging but conveniently handy Maximum Entropy part-of-speech tagger and ran it on Pullum's post.

The first thing to notice about the output is the depressing amount of tokenization it takes to make sure that spurious errors do not arise.

Errors? Of course, there are some, but not as many as one would expect. More importantly, the tagger puts its own label biased tongue in its cheek and creates a new verb, to Thaler:

And_CC in_IN that_DT case_NN ,_, a_DT word_NN of_IN gratitude_NN to_TO Thaler_VB -LCB-_-LRB- otherwise_RB an_DT unimportant_JJ screwball_NN -RCB-_-RRB- ._.

Here is the entire output of the tagger on Pullum's post:

A_DT verbless_JJ novel_NN ?_. Why_WRB ?_. ?_. What_WP reason_NN for_IN the_DT accomplishment_NN by_IN this_DT showy_NN fool_NN in_IN France_NNP ,_, Michel_NNP Thaler_NNP ,_, his_PRP$ effort_NN at_IN an_DT entire_JJ novel_NN with_IN no_DT verbs_NNS -LCB-_-LRB- perhaps_RB not_RB a_DT wise_JJ or_CC lucrative_JJ publication_NN venture_NN ,_, given_VBN the_DT not_RB total_JJ incorrectness_NN of_IN my_PRP$ speculations_NNS -RCB-_-RRB- recently_RB evident_JJ amongst_IN the_DT vast_JJ efflux_NN of_IN absurd_JJ literary_JJ pretense_NN in_IN the_DT French_JJ language_NN ?_. Well_UH ,_, whatever_WDT his_PRP$ reasons_NNS ,_, in_IN response_NN ,_, my_PRP$ own_JJ contribution_NN :_: a_DT verbless_JJ post_NN -LCB-_-LRB- the_DT first_JJ on_IN Language_NNP Log_NNP -RCB-_-RRB- ._. No_DT verbs_NNS at_IN all_DT in_IN this_DT book_NN of_IN Thaler_NNP 's_POS ,_, just_RB nouns_NNS ,_, pronouns_NNS ,_, adjectives_NNS ,_, adverbs_NNS ,_, prepositions_NNS ,_, subordinators_NNS ,_, coordinators_NNS ,_, and_CC --_: oh_UH !_. --_: interjections_NNS ._. All_PDT those_DT among_IN the_DT permissible_JJ -LCB-_-LRB- and_CC for_IN him_PRP ,_, past_JJ participles_NNS too_RB ,_, though_IN no_DT participial_JJ intrusions_NNS in_IN this_DT post_NN ,_, such_PDT the_DT extreme_JJ character_NN of_IN my_PRP$ cruel_NN and_CC unreasonable_JJ self-applicable_JJ strictures_NNS -RCB-_-RRB- ,_, but_CC never_RB one_CD single_JJ solitary_JJ verb_NN ._. And_CC ,_, fantastically_RB ,_, all_PDT this_DT a_DT vision_NN of_IN some_DT liberation_NN for_IN authors_NNS ,_, not_RB an_DT absurd_JJ literary_JJ straitjacket_NN with_IN the_DT writer_NN -LCB-_-LRB- albeit_IN willingly_RB -RCB-_-RRB- imprisoned_VBN within_IN it_PRP ._. Some_DT freedom_NN ,_, this_DT ._. Thaler_NNP :_: nuts_NNS ,_, bonkers_NNS ,_, round_VBP the_DT bend_NN ._. Mad_NNP as_IN a_DT March_NNP hare_NN ._. The_DT Liberman_NNP conjecture_NN -LCB-_-LRB- about_IN survival_NN of_IN high_JJ school_NN literary_JJ experimentation_NN into_IN adulthood_NN because_IN of_IN a_DT dysfunctional_JJ authoritarian_JJ French_JJ educational_JJ system_NN -RCB-_-RRB- :_: probably_RB true_JJ ._. My_PRP$ attitude_NN :_: contempt,_NN really_RB ._. Except_IN ..._: Unless_IN ..._: Just_RB possibly_RB ,_, an_DT exercise_NN ,_, for_IN the_DT undergraduates_NN in_IN my_PRP$ course_NN on_IN English_JJ grammar_NN this_DT fall_NN quarter_NN ._. An_DT effort_NN at_IN construction_NN of_IN fifty_JJ words_NNS of_IN coherent_JJ prose_NN with_IN never_RB a_DT verb_NN ,_, with_IN only_RB those_DT in_IN possession_NN of_IN enough_JJ grammatical_JJ knowledge_NN for_IN verb_NN identification_NN capable_JJ of_IN success_NN ._. Worth_JJ a_DT try_NN ,_, perhaps_RB ._. And_CC in_IN that_DT case_NN ,_, a_DT word_NN of_IN gratitude_NN to_TO Thaler_VB -LCB-_-LRB- otherwise_RB an_DT unimportant_JJ screwball_NN -RCB-_-RRB- ._. Always_RB that_DT extra_JJ possibility_NN :_: the_DT idea_NN justifiable_JJ not_RB because_IN of_IN its_PRP$ implementation_NN ,_, but_CC in_IN virtue_NN of_IN a_DT complementary_JJ or_CC counterposed_JJ idea_NN emergent_NN in_IN the_DT mind_NN of_IN someone_NN else_RB --_: serendipitous_JJ bastard_NN offspring_NN of_IN a_DT deranged_VBN cognitive_JJ parent_NN ._. So_IN my_PRP$ gratitude_NN to_TO you_PRP ,_, Thaler_NNP ,_, you_PRP pusillanimous_JJ poseur_NN ,_, you_PRP literary_JJ clown_NN ._. A_DT new_JJ idea_NN !_. My_PRP$ idea_NN ,_, all_DT mine_NN -LCB-_-LRB- accessible_JJ here_RB on_IN Language_NNP Log_NNP to_TO just_RB a_DT few_JJ thousand_CD close_JJ friends_NNS -RCB-_-RRB- ._. Ooh_NNP ,_, one_CD other_JJ thought_NN ,_, for_IN computational_JJ linguists_NNS :_: What_WP bets_VBZ on_IN the_DT performance_NN of_IN part-of-speech_JJ tagging_VBG algorithms_NNS on_IN prose_NN such_JJ as_IN this_DT ?_.

Here is a quick cheat sheet for those who have not yet memorized the Penn Treebank tagset (shame on you!):

CC Coordinating Conjunction CD Cardinal Number DT Determiner IN Preposition JJ Adjective -LRB- Left bracket NN Noun, singular NNP Proper Noun, singular NNS Proper Noun, plural PDT Predeterminer POS Possessive ending PRP Personal pronoun PRP$ Possessive pronoun RB Adverb -RRB- Right bracket TO to UH Interjection VB Verb, base form VBG Verb, past tense VBN Verb, gerund or present partciple VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner WP Wh-pronoun WRB Wh-adverb

Posted by anoop at 10:41 PM

April 25, 2004

Open book to page 23 meme

From Boosting the margin: A New Explanation for the Effectiveness of Voting Methods by Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. The Annals of Statistics, 26(5), 1998.

To the best of our knowledge, similar iterative schemes for combining functions have not been studied for the log loss.

Instructions:

  1. Grab the nearest book.

  2. Open the book to page 23.

  3. Find the fifth sentence.

  4. Post the text of the sentence in your journal along with these instructions.

(Via fresh tracks which comes via join-the-dots via Dennis and Scott).

Note that I have subverted the meme from the original instructions which says the source should be a book and changed it to a journal paper. This has two motivations: first, the nearest object of my attention was this paper and not a book; and second, since the meme is now in a second generation, some mutation is expected to occur.

Posted by anoop at 10:05 AM

February 05, 2004

Y-chromosome diversity and Central Asia

A PNAS paper on Y-chromosome diversity in Central Asian populations includes the following tree:

It's filled with curious facts. For instance, check out the cluster that contains Sourashtran and Yadhava (both Indian populations) and the Tajik/Samarkhand or Arab/Bukhara populations.

This blog post by John McWhorter cites this paper in support of a particular theory of historical change from Avestan (Old Persian) to Modern Persian.

Posted by anoop at 06:01 PM

February 04, 2004

1421: The Year China Discovered America by Gavin Menzies

In case you are planning on picking up this book: 1421: The Year China Discovered America by Gavin Menzies, you might want to read Bill Poser's blog post discussing 1421, in particular about some of the linguistic facts and dubious research methods used in this book.

In case you are too lazy to click on the above link, I am quoting below from Bill Poser's post about 1421 the parts I found to be the most entertaining (maybe reading this will encourage you to read the entire post that is linked above):

The first linguistic point raised in the book (p. 104) concerns an inscription found in the Cape Verde islands off the West coast of Africa, which Menzies attributes to Zheng He. Unable to identify the writing system, he wonders whether it is an Indian writing system and faxes a query to the Bank of India, which informs him that it is Malayalam. Unfamiliar with Malayalam, he asks where it was spoken and whether it was in use in the 15th century. According to Menzies, the Bank of India responded as follows:

Yes, it had been in common use since the ninth century. It has largely ceased to be spoken today, though it is still used in a few outlying coastal districts on the Malabar coast.

In fact, Malayalam is spoken by over 35 million people. It doesn't seem likely that the Bank of India was unaware of the principal language of Kerala State, one of the national languages specified in Schedule Eight of the Constitution of India. Maybe they were pulling Menzies' leg, or maybe he just can't get his facts straight.

Assuming that there is an inscription in Malayalam in the Cape Verde Islands, what does this tell us about Zheng He's voyage? Is there evidence that it dates to the 1420s? Whenever it was made, isn't the most likely hypothesis that an Indian made it? The content of the inscription might shed light on this, but although much is made of the writing system, we never find out what it says!

...

Menzies continues:

There is also linguistic evidence of Chinese visits to South America. A sailing ship is chamban in Colombia, sampan in China; a raft, balsa in South America and palso in China; a log raft, jangada in Brazil, ziangada in Tamil.

We aren't told which of the 98 languages of Colombia, the 234 languages of Brazil, or the roughly 700 of South America as a whole, these words come from. In any case, isolated similarities like these are meaningless; it is easy to find a few words similar in sound and meaning in any two languages. At least two of the three examples here are wrong. You'd think that a Royal Navy man would know that a sampan is not a sailing ship; it is a small boat usually propelled by two oars. There is no Chinese word palso meaning "raft"; no Chinese syllable ends in /l/. And even if the pair of words for "log raft" are correct and their resemblance is not accidental, how would this prove contact between China and Brazil? Menzies is apparently assuming that the only way a Tamil word could get to Brazil is via Zheng He's fleet, and that it is likely that Brazilians would borrow a word for something with which they were no doubt already familiar from the tiny minority of Tamil speakers who might have accompanied the Chinese fleet.

...

Menzies gives further evidence of contact between China and the New World on p. 414:

Like the Waldseemüller chart, another map of Vancouver Island, called `colonie chinois' by its Venetian cartographer, Antonio Zatta, was published before Vancouver or Cook `discovered' the island. The Squamish Indians there have more than forty words in common with Chinese, including tsil (wet), also tsil in Chinese; chi (wood), which is chin in Chinese; and tsu (grandmother), which is etsu.

Menzies does not give the other 37 putatively similar words in Chinese and Squamish, nor does he cite sources for the Chinese and Squamish words. The fact that he is wrong about where the Squamish live (their territory is on the mainland of British Columbia, just north of the city of Vancouver, not on Vancouver Island) does not give confidence in his data. In any case, the examples that he does provide are dubious. Not one of the three words claimed to be Chinese is identifiable as Chinese.

Good stuff. For more read Bill's original post.

Posted by anoop at 04:31 PM

January 27, 2004

The Razor Wire Looking Glass by Greg Egan

Australian science-fiction author, Greg Egan, has taken time off from his fiction writing to investigate the procedure of immigration detention in his country.

His essay on the topic is called The Razor Wire Looking Glass.

One sentence in this essay was particularly intruiging:

There are institutionalised flaws in the system, such as the language tests routinely used for validating people's nationality that have been discredited by professional linguists.

I wonder what kind of language test can prove that one is from a particular country. Kafka (if he used speech reco) might imagine the following scenario. Perhaps they ask people to talk into ViaVoice and measure the word error rate: "Edit distance of 24? You must be from Bhutan.''

Posted by anoop at 01:01 PM

January 22, 2004

Search engine for linguists

Phil Resnik has created a Search engine for linguists. It allows the user to search for particular sentence structures or parse trees. The parses are generated by running Eugene Charniak's statistical parser.

The idea, if I understand it correctly, is to search for structural matches rather than matches on the words. So, for example, if the user was interested in the class of sentences typified by the sentence:

"John ate the meat raw"

Then using the Query page of the Linguist search engine the user could search for the following parse tree (plug your Penn Treebank notation memory module into your brain first):

(VP (VBD ate)(S NP (ADJP JJ)))

According to Philip, in the forum post explaining this query, the first 20 hits include:

Just because they eat it raw doesn't mean that they don't want it fresh. Partial decomposition would be a good alternative for seasoning and tenderising if you have to eat it raw. Eat them broiled, grilled or blackened. All the hypocrisy around me oh God don t let me fall,they might just eat me alive. Eat them smoked, pickled, or cooked. Then the baby Kangaroo's can eat them alive!

Posted by anoop at 01:24 PM

Search engine for linguists

Phil Resnik has created a Search engine for linguists. It allows the user to search for particular sentence structures or parse trees. The parses are generated by running Eugene Charniak's statistical parser.

The idea, if I understand it correctly, is to search for structural matches rather than matches on the words. So, for example, if the user was interested in the class of sentences typified by the sentence:

"John ate the meat raw"

Then using the Query page of the Linguist search engine the user could search for the following parse tree (plug your Penn Treebank notation memory module into your brain first):

(VP (VBD ate)(S NP (ADJP JJ)))

According to Philip, in the forum post explaining this query, the first 20 hits include:

Just because they eat it raw doesn't mean that they don't want it fresh. Partial decomposition would be a good alternative for seasoning and tenderising if you have to eat it raw. Eat them broiled, grilled or blackened. All the hypocrisy around me oh God don t let me fall,they might just eat me alive. Eat them smoked, pickled, or cooked. Then the baby Kangaroo's can eat them alive!

Posted by anoop at 01:24 PM

January 21, 2004

Syntactic Processing in a Nonhuman Primate

Two posts by Mark Liberman on the languagelog about the Jan 16 Science magazine article: by Tecumseh Fitch and Marc Hauser entitled Computatational Constraints on Syntactic Processing in a Nonhuman Primate, and a "Perspective" piece by David Premack entitled Is Language the Key to Human Intelligence?

The comments by Mark Liberman:

Posted by anoop at 04:33 PM

gloof, spooce, gloof twain, spooce, gairk

Very cool piece of work brought to my attention by Mark Liberman on the languagelog:

ShortTalk is a speech interface for composing text. Think of it as a "little" programming language that is speech-based and which you can freely intersperse in between normal English speech with some guarantee that the speech recognition algorithms will not freak out on you.

Here's a brief clipping from the webpage that shows how to use ShortTalk to add some space around a "+"-sign

Before

z = x+y|

After

z = x + y|

ShortTalk solution

gloof, spooce, gloof twain, spooce, gairk

Posted by anoop at 04:23 PM

October 24, 2003

Voice Recognition Software Yelled At

From the Onion, vol 39, issue 41 America's Finest News Source(TM) 22 October 2003.

NEW YORK ”Fidelity Financial Services' Gwen Watson, 33, shouted angrily at her IBM ViaVoice Pro USB voice-recognition software, sources close to the human-resources administrator reported Monday. "No, not Gary Friedman! Barry Friedman, you stupid computer. BARRY!" Watson was heard to scream from her cubicle. "Jesus Christ, I could've typed it in a hundredth of the time." After another minute of yelling, Watson was further incensed upon looking at her screen, which read, "Barely Freedman you God ram plucking pizza ship."

Hmm. When good language models go bad ...

Posted by anoop at 01:01 PM

October 21, 2003

Critical Opalescence

From Clockwork Science By Freeman J. Dyson, a review of Einstein's Clocks, Poincaré's Maps: Empires of Time by Peter Galison. (published in the New York Review of Books, Vol 50, Num 17, Nov 6, 2003 temporary url)

Galison uses the phrase "critical opalescence" to sum up the story of what happened in 1905 when relativity was discovered. Critical opalescence is a strikingly beautiful effect that is seen when water is heated to a temperature of 374 degrees Celsius under high pressure. 374 degrees is called the critical temperature of water. It is the temperature at which water turns continuously into steam without boiling. At the critical temperature and pressure, water and steam are indistinguishable. They are a single fluid, unable to make up its mind whether to be a gas or a liquid. In that critical state, the fluid is continually fluctuating between gas and liquid, and the fluctuations are seen visually as a multicolored sparkling. The sparkling is called opalescence because it is also seen in opal jewels which have a similar multicolored radiance.

Galison uses critical opalescence as a metaphor for the merging of technology, science, and philosophy that happened in the minds of Poincaré and Einstein in the spring of 1905. Poincaré and Einstein were immersed in the technical tools of time signaling, but the tools by themselves did not lead them to their discoveries. They were immersed in the mathematical ideas of electrodynamics, but the ideas by themselves did not lead them to their discoveries.

...

The one question that Galison's metaphor of critical opalescence does not answer is why Einstein discovered the theory of relativity as we know it and Poincaré did not. The theories discovered by Poincaré and Einstein were operationally equivalent, with identical experimental consequences, but there was one crucial difference. The difference was the use of the word "ether."

...

The essential difference between Poincaré and Einstein was that Poincaré was by temperament conservative and Einstein was by temperament revolutionary. When Poincaré looked for a new theory of electromagnetism, he tried to preserve as much as he could of the old. He loved the ether and continued to believe in it, even when his own theory showed that it was unobservable. His version of relativity theory was a patchwork quilt. The new idea of local time, depending on the motion of the observer, was patched onto the old framework of absolute space and time defined by a rigid and immovable ether. Einstein, on the other hand, saw the old framework as cumbersome and unnecessary and was delighted to be rid of it.

...

Looking back upon this history, I disagree with Galison's conclusion. I do not see critical opalescence as a decisive factor in Einstein's victory. I see Poincaré and Einstein equal in their grasp of contemporary technology, equal in their love of philosophical speculation, unequal only in their receptiveness to new ideas. Ideas were the decisive factor. Einstein made the big jump into the world of relativity because he was eager to throw out old ideas and bring in new ones. Poincaré hesitated on the brink and never made the big jump. In this instance at least, Kuhn was right. The scientific revolution of 1905 was driven by ideas and not by tools.

Posted by anoop at 11:02 AM

October 02, 2003

Famous researchers and "Work at google" ads

Not so long ago, if you searched for "machine learning" or "computational linguistics" in google, you would get an ad (so-called Sponsored Link) from google: the "Work at google" ad.

Not so well known was that due to the inherent (or explicit) clustering over queries, you would also get this ad if you searched for a particular name. The names were usually of famous computational linguists or machine learning people. For example, if you searched for fernando pereira you would get an ad asking whether you would like to "Work at google". Unfortunately, Fernando's name no longer triggers the ad.

But some other names still do. The following is only a partial list of names that trigger "Work at google" ads:

Work at google Ads

  • dekai wu
  • andrew mccallum
  • yaov freund
  • daniel marcu
  • vladimir vapnik
  • soumen chakrabarti
  • david yarowsky

A small variation is a more specific ad which targets NLP searchers:

Work on NLP at google (with the blurb "google is hiring experts in statistical natural language processing")

  • aravind joshi
  • fred jelinek
  • robert schapire
  • dan jurafsky
  • steven abney
  • stuart shieber

What is equally suprising is that other names that you might think of as being in this class do not trigger the same ad. So what is the key that distinguishes these people from other, arguably just-as-famous researchers?

Posted by anoop at 02:42 PM