Source Code
These are some small scripts stored here for my own personal
convenience. If any of these code snacks are useful for you, take
them. However, do not expect any guidance on their use. If you cannot
figure out how to use them, you probably don't need to use them.
Contents
- filter-mbox.py - filter your
email messages in an mbox file by year.
- count-mbox.py - count the total
number of messages in an mbox file.
- reduceFileSize.pl - reduce
file size by randomly picking one out of n lines.
- fourperpage1 and fourperpage2 - printing four per page
postscript properly using pstops (each script does a different
orientation)
- Chinese viewtree - modified
viewtree (a tcl/tk script to display parse trees written by Adwait Ratnaparkhi)
to accomodate Chinese characters in the leaves.
- CSV to Vcard - converts
comma separated file containing phone numbers into the Vcard
format.
- Perl interpreter - using
the GNU readline package, useful for interactive demo sessions or
Perl tutorials.
- levenshtein edit
distance - in Perl. Also, edit-distance-tex
produces LaTeX output using tree-dvips
(here is a sample pdf
output).
- test_fsm_input - simple
script for testing acceptance with the AT&T fsm toolkit.
- concordance - provides
keyword in context (one per line).
- indentrees - indents trees
from the Penn Treebank (sensitive to page widths).
- multiple_anchors -
prints out examples of multiple anchors from the XTAG word
database.
- wsj_enum - cheap enumeration
of s-expressions from selected sections of the Penn Treebank cdrom
suitable for output to viewtree.
- chnames - small script to rename
many files in a directory with a new prefix or suffix.
- corpus2forest -
Conversion from Penn Treebank LEXTRACT output to LEM derivation
forest.
- ppattach - Simple
unsupervised PP attachment.
- ckytig - Simple Perl
implementation of a CKY recognizer and parser for probabilistic
Tree Insertion Grammars (TIGs).
- ckycfg - Simple Perl
implementation of a CKY recognizer and parser for probabilistic
Context Free Grammars (CFGs).
- searchmail - Simple shell
script to search for a particular string in all nested MH
folders.
- holes - finds quantifier scope
assignments from an underspecified representation.
- find_texfiles -
recursively searches directories for ps files generated from
tex files.
- printparens - fancy regexp
handling to read constituents from bracketed text (interesting
comparison between perl, python and java implementations over
the entire Treebank).
- autoindex - automate input to
makeindex for LaTeX files.
- mixture - implementation of linear
interpolation between a word language model and a POS language
model using the EM algorithm.
- consist - creates stochastic
expectation matrix in MATLAB syntax from input probabilistic
CFG.
- trigen - implementation of the
Shannon game: includes a trigram generator.
- cfggen - sentence generator using a
probabilistic CFG (non-lexicalized).
- CFG extraction script - tgrep
script to extract a CFG (with counts) from the Penn
Treebank.
- logit - code for regression using
the logistic transform.
- ppgen - perl poetry
generator. sample
output. Won Honorable Mention at the First
Annual Perl Poetry Contest.
- treebank-reader
- lex and yacc code to produce grammars from the Penn
Treebank.
- convert_morph.pl - converts
the XTAG morphological file into the Xerox finite state
transducer input format.
- simple_chunk.pl - very
simple noun phrase chunker for sentences tagged with Penn
Treebank POS tags.
- susanne2sexp.pl - converts
Susanne corpus markup to sexp format (Penn Treebank
format).
- genrand.pl - shuffles sequence
[1 .. n] into a random order.
- invert.pl - prints words that
look the same when upside-down.
- bibquery.pl -
Perl
code to search BibTeX files for various fields and display the result
in HTML or BibTeX.
- indexpage.py -
Python code to
create an alphabetically sorted list of links in a HTML file.
- unread.pl - tiny perl script to
check for unread mail in mh folders. Useful if you're using
slocal or procmail to sort mail into folders.
- duff.c - dissection of the infamous Duff
device.
- context-sensitive.txt -
context sensitive grammar for crossing dependencies.
[Home]