CMPT 825 - Spring 2006 - Natural Language Processing

Natural Language Processing (NLP) is the automatic analysis of human language by computer algorithms. These algorithms can be used to convert unstructured data such as transcribed speech or large text collections (like the web) into structured forms. NLP algorithms allow us to take abstract information embedded in language and produce informative labels (e.g. identifying a sequence of words as the name of a protein or a company name) using very little specialized information. The course will be mainly covering statistical machine learning methods for NLP. (This course will be in Area 3).

This course will focus on statistical machine translation and parsing for the first half of the course. These two aspects of NLP will be used to motivate and describe various computational and statistical models of language. The goal of the first half of the course will be to get you started on your project well before the half-way point of the course. The project will be fairly writing intensive involving several written submissions. In addition, students will have to scribe concise summaries of the various topics covered in class. The second half of the course will focus on a selection of formal models in computational linguistics and NLP tasks providing a broader perspective on contemporary CL/NLP research.

Announcements

Assignments

  1. Homework #1
  2. Homework #2
  3. Homework #3
  4. Homework #4
  5. Homework #5. Location of files: /cs/natlang-a/data/zipf-expt/
  6. Homework #6
  7. Homework #7

Note on assignments: All homeworks are optional, so there is no deadline. However, doing the homeworks will probably help you substantially in your project work and in understanding the course material.

Scribes

  1. Lecture #1 by Gholamreza Haffari
  2. Lecture #2 by D. Song
  3. Lectures #3 and #4 by Gholamreza Haffari
  4. Lecture #5 by Maxim Roy
  5. Lecture #6 by Maxim Roy
  6. Lecture #7 by Akshay Gattani
  7. Lecture #10 by F. Hormozdiari
  8. Lecture #11 by Mehdi M. Kashani
  9. Lecture #16 and #17 by Javier Thaine
  10. Lecture #22 by Javier Thaine

Scribing instructions: you must use LaTeX to create your scribe document. Use scribe.sty as the LaTeX style file. A sample scribe document scribe_sample.tex is provided as an example document. On any of the CS/FAS Unix/Linux machines use the command pdflatex scribe_sample.txt to produce scribe_sample.pdf

Syllabus and Readings

  1. Statistical Machine Translation, Basics and Evaluation Methods
  2. SMT, IBM word-based models
  3. SMT, Phrase-based models
  4. SMT, Decoding
  5. SMT, Word-based Language Modelling and Smoothing
  6. Statistical Parsing, Basics and Evaluation Methods
  7. Parsing, Bi-lexical generative models, The EM algorithm
  8. Parsing, Discriminative models (log-linear models, history-based models), Global linear models
  9. Parsing, Language Modelling, Semantic parsing and other applications
  10. SMT, Syntax-based models
  11. Text segmentation
  12. Text coherence and Co-reference
  13. Text summarization
  14. Natural Languages, Formal Languages and Complexity: from regular to context-sensitive
  15. Finite-state transducers: computational phonology and text-to-speech
  16. Tree automata, Tree transducers: parsing and SMT

Textbook and References

There is no formal textbook for this course. Most of the reading for this course is posted along with the topics below and are research papers which are usually available online. However, if you would like to brush up on some of the basics you should refer to the following books:

Web Links


anoop at cs.sfu.ca