Lex and Yacc code for reading the Treebank and extracting CFG rules
Download: treebank-reader.uu
To unpack: uudecode treebank-reader.uu; tar zxvf treebank-reader.tar.gz
To compile: make
To find out options in the PCFG extraction process:
treebank-reader -h
translate-rules.pl -h
Example run:
cat ../treebank-cdrom3/parsed/mrg/wsj/[0-2][0-5]/wsj_*.mrg | ./treebank-reader -r | ./translate-rules.pl -c=0 -g=1 > out.gram
../cfggen/cfggen.pl out.gram 20
Section 00: 1921 sentences Section 00-09: 18318 sentences
Check output:
It might be a good idea when using a complex set of options to use the g=2 option of translate-rules.pl and to double check that each PCFG rule is unique:
cat ../treebank-cdrom3/parsed/mrg/wsj/0[1-3]/*.mrg | ./treebank-reader -r -p | ./translate-rules.pl -c=0 -t=1 -g=2 -f=0 > out.gram
cat out.gram | perl -ane 'print join(" ", @F[1..$#F]), "\n"' | sort | uniq -c | sort -nr
Files: