## Announcements

- Style files for proposal and final project write-up: latex style file, sample latex file, bibliography style file, and the pdf file created from acl08.tex containing the instructions.
- Grading for the course:

- Scribing and Class Participation: 30%
- Project proposal writing and reviewing: 30%
- Final paper and project: 40%
- Important Dates:
- Mon, Jan 7: First day of class
- Mar 19: Proposal for projects due date
- Mar 28: Proposal review due date
- Apr 19: Final project paper and implementation due date
- Mon, Apr 7: Last day of class

## Assignments

**Note on assignments**: All homeworks are optional, so there is no deadline.
However, doing the homeworks will probably help you substantially in your project work
and in understanding the course material.

All materials for the homeworks will be available from `~anoop/cmpt825`

## Scribes

- Scribe #1: Ajeet Grewal
- Scribe #2: Anton Venema
- Scribe #3: Milan Tofiloski
- Scribe #4: Javad Safaei
- Scribe #5: Mohsen Jamali
- Scribe #6: Steve Fagan
- Scribe #7: Winona Wu
- Scribe #8: Sankaran Baskaran
- Scribe #9: Mohammad Norouzi
- Scribe #10: Chris Nell
- Scribe #11: Louisa Harutyunyan

Scribes will take the lead in presenting the papers we are reading that week on the Wed/Fri class. On Mondays, I will present an introductory class on the topic for that week. The discussion can be led by using the blackboard, or in some cases (if the example is too long to draw on the board) you can use Powerpoint slides or equivalent. Please let me know if you will need the digital projector for any class.

Scribing instructions: you **must** use LaTeX to create your
scribe document. Use `scribe.sty`

as the LaTeX style file. A sample scribe document `scribe_sample.tex`

is
provided as an example document. On any of the CS/FAS Unix/Linux
machines use the command `pdflatex scribe_sample.txt`

to produce `scribe_sample.pdf`

Scribe deadline: the scribe notes must be submitted by Wed of the next week following the week being scribed. This will allow discussion of the scribed notes in the Fri class.

## Syllabus and Readings

We will cover the following topics in this course. The weekly readings are listed below.

### Topics

- Text Mining
- Machine Translation

## Weekly Schedule and Readings

- Automata models of language: Finite-state transducers
- Slides #1
- Finite-state transducer toolkits:
- Readings: Jan 9
- Mehryar Mohri. Finite-State Transducers in Language and Speech Processing. Computational Linguistics, 23:2, 1997.
- Mehryar Mohri. Weighted Finite-State Transducer Algorithms: An Overview. In Carlos Martin-Vide, Victor Mitrana, and Gheorghe Paun, editors, Formal Languages and Applications. volume 148, VIII, 620 p., pages 551-564. Springer, Berlin, 2004.

- Readings: Jan 11
- NLP applications for FSTs (from Mohri, 1997)
- Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. The Design Principles of a Weighted Finite-State Transducer Library. Theoretical Computer Science, 231:17-32, January 2000.
- Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. OpenFst: a general and efficient weighted finite-state transducer library. In Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA 2007). volume to appear of Lecture Notes in Computer Science, Prague, Czech Republic, July 2007. Springer-Verlag, Heidelberg, Germany. (slides)

- Text Mining with Hidden Markov Models
- Slides #2 (Viterbi demo and Forward-Backward demo).
- Slides #3.
- Yet Another Introduction to HMMs. Anoop Sarkar.
- Maximum a-posteriori Estimation for HMMs. Anoop Sarkar.
- Part of Speech Tagging Guidelines for the Penn Treebank.
- Toolkits:
- HTK Toolkit.
- Graphical Models Toolkit (GMTK) by Jeff Bilmes and Geoff Zweig
- UMDHMM software by Tapas Kanungo.
- HMMs and Factorial HMMs in matlab by Zoubin Ghahramani.
- HMM Toolbox for matlab by Kevin Murphy.
- Bayes net Toolbox for matlab by Kevin Murphy.

- Readings: Jan 16 to Jan 23
- Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. of the IEEE. 1989.
- Scott Thede and Mary Harper. A Second-Order Hidden Markov Model for Part-of-Speech Tagging. Proc. of the 37th Annual Meeting of the Association for Computational Linguistics. 1999.

- Readings: Jan 25
- Bernard Merialdo. Tagging English Text with a Probabilistic Model. Computational Linguistics, Volume 20, Number 2, June 1994.
- David Elworthy. Does Baum-Welch Re-estimation Help Taggers? Fourth Conference on Applied Natural Language Processing. 1994.

- Readings: Jan 30
*Snow Day!*

- Readings: Feb 1
- Daniel M. Bikel, Richard Schwartz and Ralph M. Weischedel. An Algorithm that Learns What's in a Name. in the Machine Learning Journal Special Issue on Natural Language Learning. 1999.
- Trond Grenager, Dan Klein and Christopher Manning. Unsupervised Learning of Field Segmentation Models for Information Extraction. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), 2005.

- Readings: Feb 4
- Michele Banko and Robert C. Moore. Part-of-Speech Tagging in Context. 20th International Conference on Computational Linguistics (COLING), August 23-27, 2004.
- John Miller, Manabu Torii and K. Vijay-Shanker. Building Domain-Specific Taggers without Annotated (Domain) Data. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007.
*Extra Reading*: Qin Iris Wang and Dale Schuurmans. Improved Estimation for Unsupervised Part of Speech Tagging. In Proc. of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE). 2005.*Extra reading*: Silviu Cucerzan and David Yarowsky. Language independent minimally supervised induction of lexical probabilities. Proceedings of ACL-2000, Hong Kong, pages 270-277. 2000.

- Readings: Feb 6
- Steven Abney and Marc Light. Hiding a Semantic Hierarchy in a Hidden Markov Model. In Proc. of the ACL-1999 Workshop on Unsupervised Learning in Natural Language Processing. 1999.
- Regina Barzilay and Lillian Lee. Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization. In Proc. of HLT-NAACL 2004: Human Language Technology Conference and Meeting of the North American Chapter of the Association for Computational Linguistics. 2004.

- Readings: Feb 8
- Chapter 2 (Sections 2.4-2.5) and Chapter 3: Andreas Stolcke. Bayesian Learning of Probabilistic Language Models. Ph.D., thesis, University of California at Berkeley. 1994.

- Readings: Feb 11
- Introduction to Wordnet.
- Catch up with Stolcke, and Abney and Light.

- Readings: Feb 13
- Thorsten Brants. Estimating HMM Topologies. In J. Ginzburg, Z. Khasidashvili, C. Vogel, J.-J. Levy, E. Vallduvi (eds.), The Tbilisi Symposium on Logic, Language and Computation: Selected Papers. CSLI Publications, Stanford, California. 1998.
- Matthew Brand. Structure learning in conditional probability models via an entropic prior and parameter extinction. Neural Computation. Volume 11, Issue 5, July 1999.

- Readings: Feb 15
- Kevin Duh, Jointly Labeling Multiple Sequences: A Factorial HMM Approach. 43rd Annual Meeting of the Assoc. for Computational Linguistics (ACL 2005), Student Research Workshop, Ann Arbor, Michigan, USA, June 2005.
- Software: Graphical Models Toolkit (GMTK)
*Extra Reading*: Yoshua Bengio and Paolo Frasconi. An Input-Output HMM Architecture. IEEE Transactions on Neural Networks, 7(5):1231-1249.*Extra Reading*: Zoubin Ghahramani and Michael Jordan. Factorial Hidden Markov Models. Machine Learning 29: 245-273, 1997. (Software)*Extra Reading*: Shai Fine, Yoram Singer and Naftali Tishby. The Hierarchical Hidden Markov Model, Machine Learning, 32, 1998.

- The EM algorithm
- Readings: Feb 20
- Michael Collins. The EM Algorithm. manuscript. 1997.
- Radford Neal and Geoffrey Hinton. A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants. In M. I. Jordan (editor) Learning in Graphical Models, pp. 355-368, Dordrecht: Kluwer Academic Publishers. 1998.

- Readings: Feb 22
- Noah A. Smith and Jason Eisner. Annealing Techniques For Unsupervised Statistical Language Learning. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 2004.
- Bo Thiesson, Christopher Meek, and David Heckerman. Accelerating EM for large databases. Machine Learning, 45:279-299, 2001.

- Readings: Feb 20
- Language Modeling
- Slides #4
- Readings: Feb 25-29
- Kevin Knight. Sections 1-14 from Statistical machine translation workbook. manuscript.
- Stanley Chen and Joshua Goodman. An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98, Harvard University, Aug 1998.
*Extra Reading*: On a Model-Robust Training Algorithm for Speech Recognition. A. Nadas, D. Nahamoo & M.A. Picheny. IEEE Trans. ASSP, Vol. 36, pp. 1432-1435. 1988.

- Machine Translation
- Readings: Mar 3-14
- Kevin Knight. Section 14 onwards from Statistical machine translation workbook. manuscript.
- The Mathematics of Statistical Machine Translation: Parameter Estimation. Peter E Brown; Vincent J. Della Pietra; Stephen A. Della Pietra; Robert L. Mercer. Computational Linguistics, Volume 19, Number 2, June 1993
- HMM-Based Word Alignment in Statistical Translation. Stephan Vogel; Hermann Ney; Christoph Tillmann. COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics.
- BLEU: A Method for Automatic Evaluation of Machine Translation. Kishore Papineni, Salim Roukos, Todd Ward and Wi-Jing Zhu. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. ACL-2002
- Introduction to SMT and the Bleu metric. Kishore Papineni. Presentation Slides. (for description of Bleu, jump to pages 57-75)
- Phillip Koehn, Statistical Machine Translation: the basic, the novel, and the speculative. Tutorial at EACL 2006.

- Readings: Mar 3-14
- Discriminative learning for HMMs
- Readings: Mar 17-28
**No Readings!**In-class lectures about Conditional Random Fields- An Introduction to Conditional Random Fields for Relational Learning. Charles Sutton and Andrew McCallum. In Lise Getoor and Ben Taskar, editors. Introduction to Statistical Relational Learning. MIT Press. 2007
- Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. J. Lafferty, A. McCallum, and F. Pereira. Proceedings of ICML-01, pp. 282-289 (2001)

- Readings: Apr 2-7
- A Maximum Entropy Model for Part-Of-Speech Tagging. A. Ratnaparkhi. Proc. Conference on Empirical Methods in Natural Language Processing. EMNLP 1996.
- Maximum Entropy Markov Models for Information Extraction and Segmentation. A. McCallum, D. Freitag and F. Pereira. Proc. 17th ICML. 2000.
- Shallow Parsing with Conditional Random Fields. F. Sha and F. Pereira. Proceedings of HLT-NAACL 2003 213-220 Association for Computational Linguistics (2003)
- Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. Michael Collins. EMNLP 2002.
- Large Scale Discriminative Training for Speech Recognition. P. C. Woodland and D. Povey. In ISCA ITRW Automatic Speech Recognition: Challenges for the Millenium, pages 7-16, Paris, 2000.
- Contrastive Estimation: Training Log-Linear Models on Unlabeled Data. Noah A. Smith and Jason Eisner. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 354-362, Ann Arbor, MI, June 2005.
- Protoype-Driven Learning for Sequence Models, Aria Haghighi and Dan Klein, In proceedings of HLT-NAACL 2006. (slides)

- Readings: Mar 17-28

## Extra Papers

Papers that we did not have time to read in class but maybe useful in your project work. Most papers are available at the ACL Anthology.

- Finite-state transducers
- J. Oncina, P. Garcia and E. Vidal. Learning Subsequential Transducers for Pattern Recognition Interpretation Tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 448-458. 1993.
- Andre Kempe. Finite State Transducers Approximating Hidden Markov Models. 35th Annual Meeting of the Association for Computational Linguistics. 1997.

- HMM Tagging
- Hermann Ney, Ute Essen and Reinhard Knesser. On structuring probabilistic dependencies in stochastic language modeling. Computer Speech and Language 8:1-38. 1994.
- Sajib Dasgupta and Vincent Ng. Unsupervised Part-of-Speech Acquisition for Resource-Scarce Languages. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007.
- Chris Biemann. Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering. Proceedings of the COLING/ACL 2006 Student Research Workshop. 2006.
- Tetsuji Nakagawa and Yuji Matsumoto. Guessing Parts-of-Speech of Unknown Words Using Global Information. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: COLING/ACL 2006.
- Alexander Clark. Inducing syntactic categories by context distributional clustering. In Proceedings of CoNLL, pages 91-94. 2000.
- Alexander Clark. Combining distributional and morphological information for part of speech induction. In Proceedings of the EACL. 2003.
- Dayne Freitag. Toward unsupervised whole-corpus tagging. In Proceedings of COLING, pages 357-363. 2004.
- Andrei Mikheev. Automatic rule induction for unknown word-guessing. Computational Linguistics, 23(3):405-423. 1997.
- Hinrich Schutze. Distributional part-of-speech tagging. In Proceedings of the EACL, pages 141-148. 1995.

- Bayesian Inference
- Thomas L. Griffiths and Alan Yuille. A primer on probabilistic inference. Trends in Cognitive Sciences. Supplement to special issue on Probabilistic Models of Cognition (volume 10, issue 7). 2006.
- M. J. Beal and Z. Ghahramani. The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures. 2002.
- Chapter 2 and 3: Sharon Goldwater. Nonparametric Bayesian Models of Lexical Acquisition. Unpublished doctoral dissertation, Brown University, 2006.
- Neal, R. M. Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto. 1993.
- Daniel J. Navarro, Thomas L. Griffiths, Mark Steyvers, and Michael D. Lee. Modeling individual differences using Dirichlet processes. Journal of Mathematical Psychology, 50, 101-122. 2006.
- Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. Interpolating between Types and Tokens by Estimating Power-Law Generators. Advances in Neural Information Processing Systems 18, 2006
- Mark Johnson. Why Doesn't EM Find Good HMM POS-Taggers? Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.
- Sharon Goldwater and Tom Griffiths. A fully Bayesian approach to unsupervised part-of-speech tagging. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL 2007.

- Global label consistency
- Collective information extraction with relational Markov networks. R. Bunescu and R. J. Mooney. In Proc. of 42nd ACL. 2004.
- Named entity recognition: a maximum entropy approach using global information. H. L. Chieu and H. T. Ng. In Proc. of 19th COLING. 2002.
- Incorporating non-local information into information extraction systems using gibbs sampling. J. R. Finkel, T. Grenager and C. D. Manning. In Proc of 43rd ACL. 2005.
- V. Krishnan and C. D. Manning. An effective two-stage model for exploiting non-local dependencies in named entity recognition. In Proc of 44th ACL. 2006.
- Guessing parts-of-speech of unknown words using global information. T. Nakagawa and Y. Matsumoto. In Proc 44th ACL. 2006.
- Collective segmentation and labeling of distant entities in information extraction. C. Sutton and A. McCallum. In ICML workshop on statistical relational learning and its connections to other fields. 2004.

## Textbook and References

There is no formal textbook for this course. Most of the reading for this course is posted along with the topics and are research papers which are usually available online. However, if you would like to brush up on some of the basics you should refer to the following books:

- Reference Books:

- Statistical Language Learning, Eugene Charniak, MIT Press, 1996
- Foundations of Statistical Natural Language Processing, Manning and Schuetze, MIT Press, 1999
- Speech and Language Processing, Jurafsky and Martin, Prentice Hall, 2000
- Fundamentals of Speech Recognition, Rabiner and Juang, Prentice Hall, 1993
- Machine Learning, Tom Mitchell, McGraw Hill, 1997
- Lectures on Contemporary Syntactic Theories, Peter Sells, CSLI Lecture Notes No. 3, 1985
- The Language Instinct, Steven Pinker, William Morrow, 1994