May 12, 2004

Part-of-Speech Tagging 'A Verbless Post'

Geoffrey Pullum, in full TOPIC .. COMMENT form, has posted on the language log, a reasoned critique entitled A Verbless Post of Michael Thaler's verbophobic novel. Pullum's post, of course, contains no verbs, but more to the point for this posting, has the following concluding statement:

Ooh, one other thought, for computational linguists: What bets on the performance of part-of-speech tagging algorithms on prose such as this?

I reached for Adwait Ratnaparkhi's aging but conveniently handy Maximum Entropy part-of-speech tagger and ran it on Pullum's post.

The first thing to notice about the output is the depressing amount of tokenization it takes to make sure that spurious errors do not arise.

Errors? Of course, there are some, but not as many as one would expect. More importantly, the tagger puts its own label biased tongue in its cheek and creates a new verb, to Thaler:

And_CC in_IN that_DT case_NN ,_, a_DT word_NN of_IN gratitude_NN to_TO Thaler_VB -LCB-_-LRB- otherwise_RB an_DT unimportant_JJ screwball_NN -RCB-_-RRB- ._.

Here is the entire output of the tagger on Pullum's post:

A_DT verbless_JJ novel_NN ?_. Why_WRB ?_. ?_. What_WP reason_NN for_IN the_DT accomplishment_NN by_IN this_DT showy_NN fool_NN in_IN France_NNP ,_, Michel_NNP Thaler_NNP ,_, his_PRP$ effort_NN at_IN an_DT entire_JJ novel_NN with_IN no_DT verbs_NNS -LCB-_-LRB- perhaps_RB not_RB a_DT wise_JJ or_CC lucrative_JJ publication_NN venture_NN ,_, given_VBN the_DT not_RB total_JJ incorrectness_NN of_IN my_PRP$ speculations_NNS -RCB-_-RRB- recently_RB evident_JJ amongst_IN the_DT vast_JJ efflux_NN of_IN absurd_JJ literary_JJ pretense_NN in_IN the_DT French_JJ language_NN ?_. Well_UH ,_, whatever_WDT his_PRP$ reasons_NNS ,_, in_IN response_NN ,_, my_PRP$ own_JJ contribution_NN :_: a_DT verbless_JJ post_NN -LCB-_-LRB- the_DT first_JJ on_IN Language_NNP Log_NNP -RCB-_-RRB- ._. No_DT verbs_NNS at_IN all_DT in_IN this_DT book_NN of_IN Thaler_NNP 's_POS ,_, just_RB nouns_NNS ,_, pronouns_NNS ,_, adjectives_NNS ,_, adverbs_NNS ,_, prepositions_NNS ,_, subordinators_NNS ,_, coordinators_NNS ,_, and_CC --_: oh_UH !_. --_: interjections_NNS ._. All_PDT those_DT among_IN the_DT permissible_JJ -LCB-_-LRB- and_CC for_IN him_PRP ,_, past_JJ participles_NNS too_RB ,_, though_IN no_DT participial_JJ intrusions_NNS in_IN this_DT post_NN ,_, such_PDT the_DT extreme_JJ character_NN of_IN my_PRP$ cruel_NN and_CC unreasonable_JJ self-applicable_JJ strictures_NNS -RCB-_-RRB- ,_, but_CC never_RB one_CD single_JJ solitary_JJ verb_NN ._. And_CC ,_, fantastically_RB ,_, all_PDT this_DT a_DT vision_NN of_IN some_DT liberation_NN for_IN authors_NNS ,_, not_RB an_DT absurd_JJ literary_JJ straitjacket_NN with_IN the_DT writer_NN -LCB-_-LRB- albeit_IN willingly_RB -RCB-_-RRB- imprisoned_VBN within_IN it_PRP ._. Some_DT freedom_NN ,_, this_DT ._. Thaler_NNP :_: nuts_NNS ,_, bonkers_NNS ,_, round_VBP the_DT bend_NN ._. Mad_NNP as_IN a_DT March_NNP hare_NN ._. The_DT Liberman_NNP conjecture_NN -LCB-_-LRB- about_IN survival_NN of_IN high_JJ school_NN literary_JJ experimentation_NN into_IN adulthood_NN because_IN of_IN a_DT dysfunctional_JJ authoritarian_JJ French_JJ educational_JJ system_NN -RCB-_-RRB- :_: probably_RB true_JJ ._. My_PRP$ attitude_NN :_: contempt,_NN really_RB ._. Except_IN ..._: Unless_IN ..._: Just_RB possibly_RB ,_, an_DT exercise_NN ,_, for_IN the_DT undergraduates_NN in_IN my_PRP$ course_NN on_IN English_JJ grammar_NN this_DT fall_NN quarter_NN ._. An_DT effort_NN at_IN construction_NN of_IN fifty_JJ words_NNS of_IN coherent_JJ prose_NN with_IN never_RB a_DT verb_NN ,_, with_IN only_RB those_DT in_IN possession_NN of_IN enough_JJ grammatical_JJ knowledge_NN for_IN verb_NN identification_NN capable_JJ of_IN success_NN ._. Worth_JJ a_DT try_NN ,_, perhaps_RB ._. And_CC in_IN that_DT case_NN ,_, a_DT word_NN of_IN gratitude_NN to_TO Thaler_VB -LCB-_-LRB- otherwise_RB an_DT unimportant_JJ screwball_NN -RCB-_-RRB- ._. Always_RB that_DT extra_JJ possibility_NN :_: the_DT idea_NN justifiable_JJ not_RB because_IN of_IN its_PRP$ implementation_NN ,_, but_CC in_IN virtue_NN of_IN a_DT complementary_JJ or_CC counterposed_JJ idea_NN emergent_NN in_IN the_DT mind_NN of_IN someone_NN else_RB --_: serendipitous_JJ bastard_NN offspring_NN of_IN a_DT deranged_VBN cognitive_JJ parent_NN ._. So_IN my_PRP$ gratitude_NN to_TO you_PRP ,_, Thaler_NNP ,_, you_PRP pusillanimous_JJ poseur_NN ,_, you_PRP literary_JJ clown_NN ._. A_DT new_JJ idea_NN !_. My_PRP$ idea_NN ,_, all_DT mine_NN -LCB-_-LRB- accessible_JJ here_RB on_IN Language_NNP Log_NNP to_TO just_RB a_DT few_JJ thousand_CD close_JJ friends_NNS -RCB-_-RRB- ._. Ooh_NNP ,_, one_CD other_JJ thought_NN ,_, for_IN computational_JJ linguists_NNS :_: What_WP bets_VBZ on_IN the_DT performance_NN of_IN part-of-speech_JJ tagging_VBG algorithms_NNS on_IN prose_NN such_JJ as_IN this_DT ?_.

Here is a quick cheat sheet for those who have not yet memorized the Penn Treebank tagset (shame on you!):

CC Coordinating Conjunction CD Cardinal Number DT Determiner IN Preposition JJ Adjective -LRB- Left bracket NN Noun, singular NNP Proper Noun, singular NNS Proper Noun, plural PDT Predeterminer POS Possessive ending PRP Personal pronoun PRP$ Possessive pronoun RB Adverb -RRB- Right bracket TO to UH Interjection VB Verb, base form VBG Verb, past tense VBN Verb, gerund or present partciple VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner WP Wh-pronoun WRB Wh-adverb

