NAACL HLT 2009: Student Research Workshop: List of Accepted Papers
Accepted for oral presentation (click on title to view abstract)
- Zheng Chen and Heng Ji:
Language Specific Issue and Feature Exploration in Chinese Event Extraction
We present an Information Extraction (IE) system that combines traditional IE tech-niques and cross-document event ranking. Our final goal is to provide the user with ranked events upon a query. The first step is to set up a trainable event extraction engine to glean all possible events from multiple documents. The second step is to construct our data warehouse with refined events. The third step is to build an IR-like engine to produce ranked events upon a user's query. As the first step, we developed a multilingual event extraction engine using a modularized approach. In this paper, we focus on Chinese event extraction. We point out a language specific issue in Chinese trigger labeling, and then commit to discussing the contributions of lexical, syntactic and semantic features applied in trigger labeling and argument labeling tasks. As a result, we achieved high performance comparable to state-of-the-art English event extraction.
- Jian Huang:
Solving the Who's Mark Johnson Puzzle: Information Extraction Based Cross Document Coreference
Cross Document Coreference (CDC) is the problem of resolving the underlying identity of entities across multiple documents and is a major step for document understanding. We develop a framework to efficiently determine the identity of a person based on extracted information, which includes unary properties such as gender and title, as well as binary relationships with other named entities such as co-occurrence and geo-locations. At the heart of our approach is a suite of specialists for matching relationships and a relational density-based clustering algorithm that delineates name clusters based on the pairwise similarities. We demonstrate the effectiveness of our methods on the WEPS benchmark datasets and point out future research directions.
- Karthik Gali and Sriram Venkatapathy:
Sentence Realisation from Bag of Words with Dependency Constraints
In this paper, we present five models for sentence realisation from a bag-of-words containing minimal syntactic information. It has a large variety of applications ranging from Machine Translation to Dialogue systems. Our models employ simple and efficient techniques based on n-gram Language modeling. We evaluated the models by comparing the synthesized sentences with reference sentences using the standard BLEU metric. We obtained higher results (BLEU score of 0.8125) when compared to the state-of-art results. In future, we plan to incorporate our sentence realiser in Machine Translation and observe its effect on the translation accuracies.
- Adriane Boyd:
Pronunciation Modeling in Spelling Correction for Writers of English as a Foreign Language
We propose a method for modeling pronunciation variation in the context of spell checking for non-native writers of English. Spell checkers, typically developed for native speakers, fail to address many of the types of spelling errors peculiar to non-native speakers, especially those errors influenced by differences in phonology. Our model of pronunciation variation is used to extend a pronouncing dictionary for use in the spelling correction algorithm developed by Toutanova and Moore (2002), which includes models for both orthography and pronunciation. The pronunciation variation modeling is shown to improve performance for misspellings produced by Japanese writers of English.
- Ting Qian, Benjamin Van Durme and Lenhart Schubert:
Building a Semantic Lexicon of English Nouns via Bootstrapping
We describe the use of a weakly supervised bootstrapping algorithm in discovering contrasting semantic categories from a source lexicon with little training data. Our method primarily exploits the patterns in sentential contexts where different categories of words may appear. Experimental results are presented showing that such automatically categorized terms tend to agree with human judgements.
- Dmitriy Dligach and Martha Palmer:
Using Language Modeling to Select Useful Annotation Data
An annotation project typically has an abundant supply of unlabeled data that can be drawn from some corpus, but because the labeling process is expensive, it is helpful to pre-screen the pool of the candidate instances based on some criterion of future usefulness. In many cases, that criterion is to improve the presence of the rare classes in the data to be annotated. We propose a novel method for solving this problem and show that it com-pares favorably to a random sampling baseline and a clustering algorithm.
- Manuel Kirschner and Raffaella Bernardi:
Exploring Topic Continuation Follow-up Questions using Machine Learning
Some of the follow-up questions that an Interactive Question Answering (IQA) system receives are not topic shifts, but rather continuations of the previous topic. In this paper, we propose an empirical framework to explore such questions, with two related goals in mind: (1) modeling the different relations that hold between the answer to a follow-up question and the previous dialogue, and (2) showing how this model can be used to identify the correct answer among several answer candidates. For both cases, we use Logistic Regression Models that we learn from real IQA data collected through a live system. We show that by adding features based on domain-specific actions that represent questions and answers, we obtain important additional predictors for the model, and improve the accuracy with which our system finds correct answers.
- Aditya Bhargava and Grzegorz Kondrak:
Multiple Word Alignment with Profile Hidden Markov Models
Profile hidden Markov models (Profile HMMs) are specific types of hidden Markov models used in biological sequence analysis. We propose the use of Profile HMMs for word-related tasks. We test their applicability to the task of multiple cognate alignment and cognate set matching, and find that they work well in general for both tasks. On the latter task, the Profile HMM method outperforms minimum and average edit distance. Given the success for these two tasks, we further discuss the potential applications of Profile HMMs to any task where consideration of a set of words is necessary.
- Smita Vemulapalli, Xiaoqiang Luo, John F. Pitrelli and Imed Zitouni:
Classifier Combination Techniques for Coreference Resolution: Bagging and Boosting
This paper explores the use of bagging and boosting as combination approaches for coreference resolution. To the best of our knowledge, this is the first effort that examines and evaluates the applicability of such techniques to coreference resolution. In particular, we (1) outline a scheme for adapting traditional bagging and boosting techniques to address issues, like entity alignment, that are specific to coreference resolution, (2) provide experimental evidence which indicates that the accuracy of the coreference engine can potentially be increased by use of bagging and boosting methods, without any additional features or training data, and (3) implement and evaluate combination techniques at the mention, entity and document level.
Accepted as posters (click on title to view abstract)
- Mahdy Khayyamian, Seyed Abolghasem Mirroshandel and Hassan Abolhassani:
Syntactic Tree-based Relation Extraction Using a Generalization of Collins and Duffy Convolution Tree Kernel
Relation extraction is a challenging task in natural language processing. Syntactic features are recently shown to be quite effective for relation extraction. In this paper, we generalize the state of the art syntactic convolution tree kernel introduced by Collins and Duffy. The proposed generalized kernel is more flexible and customizable, and can be conveniently utilized for systematic generation of more effective application specific syntactic sub-kernels. Using the generalized kernel, we will also propose a number of novel syntactic sub-kernels for relation extraction. These kernels show a remarkable performance improvement over the original Collins and Duffy kernel in the extraction of ACE-2005 relation types.
- Shilpa Arora:
Learning Opinions Interactively
In this work, we demonstrate that higher order features are important for learning an opinion classifier. We capture the hidden language structure by using features selected from an annotation graph constructed from prior annotations. However, with an increase in the complexity of the feature space, feature selection techniques are less effective in recognizing the most relevant features. In this work, we use additional information in the form of highlighted rationales to prune the feature space. We find that using higher order features with input from the user on the relevant spans as user's rationales help boost performance by identifying the right set of features. With rationales, the same performance can be achieved with lesser amount of annotated data.
- Taraka Rama, Anil Kumar Singh and Sudheer Kolachina:
Modeling Letter to Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training
Letter to phoneme conversion plays an important role in several applications. It can be a difficult task because the mapping from letters to phonemes can be many-to-many. We present a language independent letter to phoneme conversion approach which is based on the popular phrase based Statistical Machine Translation techniques. The results of our experiments clearly demonstrate that such techniques can be used effectively for letter to phoneme conversion. Our results show an overall improvement of 5.8\% over the baseline and are comparable to the state of the art. We also propose a measure to estimate the difficulty level of L2P task for a language.
- Jaime Acosta:
Using Emotion to Gain Rapport in a Spoken Dialog System
This paper describes research that focuses on creating more effective spoken dialog systems. Specifically, this research focuses on rapport, exploiting emotional intelligence and more human-like responsive behaviors in voice. Based on a Persuasive Graduate Coordinator Corpus, emotions and their acoustic correlates will be extracted and used to implement an emotionally intelligent dialog system. Finally, this system will be comparatively evaluated using different configurations (without rapport gaining features) through a user study.
- Kedar Bellare, Koby Crammer and Dayne Freitag:
Loss-Sensitive Discriminative Training of Machine Transliteration Models
In machine transliteration we transcribe a token from language to another language while maintaining the phonetic information. In this paper, we present a novel sequence transduction algorithm for the problem of machine transliteration. Our model is discriminatively trained by the MIRA algorithm, which improves the traditional Perceptron training in three ways: (1) It allows us to consider k-best transliterations instead of the best one. (2) It is trained based on the ranking of these transliterations according to user-specified loss function (Levenshtein edit distance). (3) It enables the user to tune a built-in parameter to cope with noisy non-separable data during training. On an Arabic-English name transliteration task, our model achieves better accuracy than a perceptron-based model with similar features, and a statistical machine translation model with more complex features.
- Elena Lloret, Alexandra Balahur, Manuel Palomar and Andrés Montoyo:
Towards Building a Competitive Opinion Summarization System: Challenges and Keys
This paper presents an overview of our participation in the TAC 2008 Opinion Pilot Summarization task, as well as the proposed and evaluated post-competition improvements. We first describe our opinion summarization system and the results obtained in the competition. Further on, we identify the system's weak points and suggest several improvements, focused both on information content, as well as linguistic and readability aspects. We obtain encouraging results, especially as far as F-measure is concerned, outperforming the competition results by approximately 80%.
- Nicole Novielli and Carlo Strapparava:
Towards Unsupervised Recognition of Dialogue Acts
When engaged in dialogues, people perform communicative actions to pursue specific communicative goals. Speech acts recognition attracted computational linguistics since long time and could impact considerably a huge variety of application domains. We study the task of automatic labeling dialogues with the proper dialogue acts, relying on empirical methods and simply exploiting lexical semantics of the utterances. In particular, we present some experiments in supervised and unsupervised framework on both an English and an Italian corpus of dialogue transcriptions. The evaluation displays encouraging results in both languages, especially in the unsupervised version of the methodology.
- Thade Nahnsen:
Domain-Independent Shallow Sentence Ordering
We present a shallow approach to the sentence ordering problem. The employed features are based on discourse entities, shallow syntactic analysis, and temporal precedence relations retrieved from VerbOcean. We show that these relatively simple features perform well in a machine learning algorithm on datasets containing sequences of events, and that the resulting models achieve optimal performance with small amounts of training data. The model does not yet perform well on datasets describing the consequences of events, such as the destructions after an earthquake.
- Stephen Tratz and Dirk Hovy:
Learning Preposition Disambiguation
In this paper, we present findings on preposition sense disambiguation using machine learning techniques. We make use of data from a SemEval task of 2007 and compare our findings to the ones of the systems competing in the workshop's competition. We extracted linguistically motivated features from the phrases surrounding the preposition. Testing with five different classifiers, we can report an increased accuracy that outperforms the best system in the SemEval task.