Laboration 2:
Basic Grammars and Parsing

Tuersday 26 February 2013

12:15-13:00 F3

All assignments must be well documented in written lab reports, which will be graded.
(Lab 2 can give a maximum of 10 points.)
The reports may be written in either any Scandinavian language or English.
Send in the solution to a specific lab exercise no later than one month after the day the assignment was handed out (NB! points will be deducted for lab reports delivered after the deadline).
Thus the deadline for lab 2 is Tuesday March 26.

The assignments will be handed out and explained at the scheduled time.
Attendance at the lecture about the assignment is recommended but not obligatory.

The lab consists of two parts, where the first concerns basic grammars and the second parsers in Prolog.

For students attending LING2206, parts 1C2, and 2F are not obligatory
(and the lab report will not be graded).

This lab exercise is carried out in Python, using the Natural Langauge
Toolkit (NLTK) for illustration and boilerplate coding. The NLTK is found
at their homepage. Please consult it for
installation instructions. Ask if it is a problem to get it running on
your equipment. NLTK can be imported as a Python module using the following command:

import nltk

if you have problems installing the toolkit on your computer, please make contact.

Part 1: Context-Free Grammars and Grammar Engineering

Download the context-free grammar found in cfg.py. Have a look at it and load it into Python.

 
from cfg import mygrammar 

grammarobj = nltk.parse_cfg(mygrammar)

Exercises

Try the grammar out for parsing. To parse a sentence like a man saw a cat, type:

sent = "a man saw a cat".split() rd_parser = nltk.RecursiveDescentParser(grammarobj) for tree in rd_parser.nbest_parse(sent): print tree
This example parses the sentence with the grammar using a recursive descent parser
implemented in NLTK. This parser descends down a parsing tree, that it recursively attempts to build looking at the input sentence.
The provided grammar does not allow for a Noun Phrase (NP) to be adjoined by a
Prepositional Phrase (PP). We want the grammar to be able to handle sentences
such as The man in the house saw Bob with a cat

Extend the grammar by adding prepisitional rules for NPs:

NP -> NP PP what happens if you straight-forwardly introduce this rule? And how can you
mitigate this? Rewrite your context-free grammar so that it can deal with these types of rules.
Unfortunately, the simple context-free grammar allows for a lot of different types of ungrammatical sentences. We would, for example, like it to only allow for noun phrases such as a dog, the dog, and the dogs (e.g., in sentences like John sees a dogs). However, the current grammar also allows for the NP *a dogs.
1. The grammar can be changed in at least two ways so that the sentence *John sees a dogs does not get a parse. Which ones?
  That is, give a description (in words) of the two ways in which a grammar can be changed/extended to reduce "leakage" (i.e., to block ungrammatical constructions).
2. Implement one of them.

Part 2: Parsing

The lectures introduced several parsing strategies. In this second part of the lab we will look at some of them and compare their performance.
In NLTK, several parsers are implemented. We have already taken a look at
the recursive descent parser, in addition to this a bottom-up
shift-reduce parser is implemented along side many chart parsers
Some example parsers from NLTK are:

nltk.RecursiveDescentParser()
nltk.ShiftReduceParser()
nltk.EarleyChartParser()
nltk.BottomUpChartParser()
nltk.EarleyChartParser()

Exercises

Parser testing
Try out each of the parsers on some simple sentences and familiarise yourself with how they work.
To analyse, for example, the sentence a man saw a man with a cat
with the grammar you have crafted and the recursive-descent parser, try:
```
sent = "a man saw a man with a cat".split()

rd_parser = nltk.RecursiveDescentParser(grammar2,trace=2)
for tree in rd_parser.nbest_parse(sent):
    print tree
```
Use different sentences and different versions of your grammar depending on
what problems you get with your parsers and grammars. Test them out and get a
feel of how they work.
You don't need to report anything from this part of the lab.
Parser visualization
NLTK offers a visualization application for the recursive-descent parser.
Open this application by typing:

nltk.app.rdparser()

and paste the grammar you have crafted into the applications grammar from the
edit menu. Click autostep and watch as it parses your input sentences. Apart
from not being able to use grammars containing rules as the ones you straight-
forwardly implemented above, this parser has another major weakness. What is
this weakness?
Top-down parsing
The recursive-descent parser in the previous builds the trees top-down,
recursively trying to expand new rules that mach the input. Use the
visualization tool to step through how the parser parses an input
sentence and in what order it expands the rules.
Write a description of this process and show some steps (3-4) of the way from the
visualization to illustrate your explanation.
Bottom-up, "Dream trees"
If you instead try the simple shift-reduce parser described in top of part 2
you will notice that this parser has problems finding a parse if the grammar
contains empty rules. Explain why this happens.
Note that this shift-reduce parser is a very simple implementation. It does
not backtrack, and is therefore not guaranteed to find a parse even for a
valid sentence according to your grammar.
Illustrate your answers with visualizations from the parser tool
nltk.app.sdparser().
Top-down breadth-first parsing
If the rules in our recursive descent parser were to be to be expanded in a breadth-first
fashion as opposed to depth-first, what would the consequences be for the parser?
Briefly describe what the difference betweeh depth-first and breadth-first parsing
are, and when this will matter.
Chart parsing
As you have seen from the visualization of the top-down recursive-descent parser
it has to do a lot of work over again, as it tries to expand new rules to match the
input. Explain how chart parsing is able to mitigate this.
A bottom-up version av a chart parser is the Cook-Kasami-Younger (CKY) algorithm.
Implement this and test it on a grammar i Chomsky Normal Form (CNF). (J&M exercise 13.3).
Why does the grammar have to be in CNF?
You are free to use a programming language of choice. But the many implementations in
NLTK should give you many hints, and certainly verification that your parser works correctly.
Parser evaluation
Create a corpus of test sentences (at least 10-12, the more the better)
and run the list through each of the parsers in turn, measuring the
processing time and the number of processing steps needed for each parser
(number of rule calls and lexicon look-ups).
To evaluate the running time of Python functions, the module timeit
is convenient. It consists of two parts, something to be repeated, and setup
statement that is carried out once. Code for testing the running time of one
of these examples could be done like this:
```
from timeit import Timer

# first argument is the code to be run, the second "setup" argument is only run once,
# and it not included in the execution time.
t = Timer("""rd_parser.nbest_parse(sent)""",
          setup="""import nltk;
from cfg import grammar1;
grammar2=nltk.parse_cfg(grammar1);
rd_parser=nltk.RecursiveDescentParser(grammar2);
sent="the man saw a man".split() 
""")

print t.timeit(1000) # repeat 1000 times instead of the default 1million
```
compare the recursive-descent parser to the a chart parser of choice. Which parser
do you think would be the most efficient one? Exlplain the result.s
1. Create the corpus.
2. Which of the parsers performs best on your test corpus?
3. And according to what measure?!
4. Try to elaborate on why this was the case.

Laboration 2:Basic Grammars and Parsing

Tuersday 26 February 2013

12:15-13:00 F3

Part 1: Context-Free Grammars and Grammar Engineering

Exercises

Part 2: Parsing

Exercises

Laboration 2:
Basic Grammars and Parsing