Hello World!


Chap08 Syntax and Grammars


  • represent certain knowledges of what we know about a language
  • General criteria
    • linguistic naturalness
    • mathematical power
    • computational effectiveness

Context-free grammars(CFG)

  • Alias
    • Phrase structure grammars
    • Backus-Naur form
  • More powerful than finite state machine
  • Rules
    • Terminals
      • words
    • Non-terminals
      • Noun phrase, Verb phrase …
    • Generate strings in the language
    • Reject strings not in the language
    • LHS → RHS
      • LHS: Non-terminals
      • RHS: Non-terminals or Terminals
    • Context Free
      • probability of a subtree does not depend on words not dominated by the subtree(subtree出現的機率和上下文無關)


  1. Does it undergenerate?
    • rules cannot completely explain language
    • not a problem if the goal is to produce a language
  2. Does it overgenerate?
    • rules overly explain the language
    • not a problem if the goal is to recognize or understand well-formed(correct) language
  3. Does it assign appropriate structures to the strings that it generates?


  • take a string and a grammar
  • assigning trees that covers all and only words in input strings
  • return parse tree for that string

English Grammar Fragment

  • Sentences
  • Noun Phrases
    • Ex. NP → det Nominal
    • head: central criticial noun in NP
      • important in statistical parsing
      • after det(冠詞), before pp(介系詞片語)
    • Agreement
      • a part of overgenerate
      • This flight(○)
      • This flights(×)
  • Verb Phrases
    • head verb with arguments
    • Subcategorization(分類)
      • categorize according to VP rules
      • a part of overgenerate
      • Prefer
        • I prefer [to leave earlier]TO-VP
      • Told
        • I was told [United has a flight]S

Overgenerate Solution
- transform into multiple rules
- NP → Single_Det Single_Nominal
- NP → 複數_Det 複數_Nominal
- Will generate a lot of rules!
- out of CFG framework
- Chomsky hierarchy of languages

Indexed Grammar

  • Indexed grammars and languages problem
  • recognized by nested stack automata

Treebanks and headfinding

critical to the development of statistical parsers


  • corpora with parse trees
    • created by automatic parser and human annotators
  • Ex. Penn Treebank
  • Grammar
    • Treebanks implicitly define a grammar
      • Simply make all subtrees fit the rules
    • parse tree tend to be very flat to avoid recursion
      • Penn Treebank has ~4500 different rules for VPs

Head Finding

  • use tree traversal rules specific to each non-terminal in the grammar
  • 先向右再向左

Dependency Grammars

  • every possible parse is a tree
    • every node is a word
    • every link is dependency relations between words
  • Advantage
    • Deals well with long-distance word order languages
      • structure is flexible
    • Parsing is much faster than CFG
    • Tree can be used by later applications
  • Approaches
    • Optimization-based approaches
      • search for the tree that matches some criteria the best
      • spanning tree algorithms
    • Shift-reduce approaches
      • greedily take actions based on the current word and state


  • Constituency(顧客, words that behave as a single unit) is a key phenomena easily captured with CFG rules
    • But agreement and subcategorization make problems

Chap09 Syntactic Parsing

  • Top-Down Search
    • Search trees among possible answers
  • Bottom-Up Parsing
    • Only forms trees that can fit the words
  • Mixed parsing strategy
    • looks like Binomial Search
    • The number of rules tried at each deicision of the analysis (branching factor)
      • top-down parsing: categories of LHS(Left Hand Side) word
      • bottom-up parsing: categories of left most RHS(Right Hand Side) word
        • 倒推:從最左邊可以倒推的字開始

backtracking methods are doomed because of two inter-related problems

  • (1)Structural and lexical ambiguity
    • PP(介系詞片語) attachment
      • PP can attach to [sentences, verb phrases, noun phrases, and adjectival phrases]
    • coordination
      • P and Q or R
        • P and (Q or R)
        • (P and Q) or R
    • noun-noun compounding
      • town widget hammer
        • ((town widget) hammer)
        • (town (widget hammer))
    • Solution
      • how to determine the intended structure?
      • how to store the partial structures?
  • (2)Shared subproblems
    • naïve backtracking will lead to duplicated work(不一定會對,所以會一直backtrack…)

Dynamic Programming

  • Avoid doing repeated work
  • Solve in polynomial time
  • approaches
    • CKY
    • Earley


  • transform rules into Chomsky-Normal Form
  • build a table
    • A spanning from i to j in the input is in [i,j]
    • A → BC == [i,j] → [i,k] [k,j]
    • entire string = [0, n]
      • expected to be S
  • iterate all possible k
    • [i,j] = [i,i+1]x[i+1, j], [i,i+2]x[i+2,j] ……
  • fill the table a column at a time, from left to right, bottom to top
    • Ex. [1,3] = [1,2]Det + [2,3] Nomimal, Noun = NP
    • Ex.
  • Algorithm
  • Performance
    • a lot of elements unrelated to the answer
    • can use online search to fill table (from left to right)


  • parser that exploits chart as data structure
  • node = vertex
  • arc = edge
    • active edge: (a) and (b)
    • inactive edge: (c)

decorated grammar

  • (a) “•” in the first
    • • NP VP
    • A hypothesis has been made, but has not been verified yet
  • (b) “•” in the middle
    • NP • VP
    • A hypothesis has been partially confirmed
  • (c) “•” in the last
    • NP VP •
    • A hypothesis has been wholly confirmed
  • representation of edge
  • initialization
    • for each rule A → W, if A is a category that can span a chart (typically S), add <0, 0, A → •W>
      • A implies •W from position 0 to 0
  • Housekeeping
    • prevent duplicate rules

Fundamental rule

  • If the chart contains <i, j, A → W1 •B W2> and <j, k, B → W3 •>, then add edge <i, k, A → W1 B •W2> to the chart
  • Notes
    1. New edge may be either active or inactive
    2. does not remove the active edge that has succeeded

Bottom-up rule

  • if adding edge <i, j, C → W1 •> to the chart, then for every rule that has the form B → C W2, add <i, i, B → • C W2>

Top-down rule

  • If adding edge <i, j, C → W1 •B W2> to the chart, then for each rule B → W, add < j, j, B →•W>

Full Syntactic Parsing

  • necessary for deep semantic analysis of texts
  • not practical for many applications (given typical resources)
    • O(n^3) for straight parsing
    • O(n^5) for probabilistic versions
    • Too slow for real time applications (search engines)
  • Two Alternatives
    • Dependency parsing
      • Change the underlying grammar formalism
      • can get a lot done with just binary relations among the words
      • 詳見Chap08 dependency grammar
    • Partial parsing
      • Approximate phrase-structure parsing with finite-state and statistical approaches
    • Both of these approaches give up something (syntactic, structure) in return for more robust and efficient parsing

Partial parsing

  • For many applications you don’t really need full parse
  • For example, if you’re interested in locating all the people, places and organizations
    • base-NP chunking
      • [NP The morning flight] from [NP Denvar] has arrived
  • Two approaches
    • Rule-based (hierarchical) transduction(轉導)
      • Restrict recursive rules (make the rules flat)
        • like NP → NP VP
      • Group the rules so that RHS of the rules can refer to non-terminals introduced in earlier transducers, but not later ones
      • Combine the rules in a group in the same way we did with the rules for spelling changes
      • Combine the groups into a cascade
        • can be used to find the sequence of flat chunks you’re interested in
        • or approximate hierarchical trees you get from full parsing with a CFG
      • Typical Architecture ![](/img/NLP/Cascaded Transducers.png)
        • Phase 1: Part of speech tags
        • Phase 2: Base syntactic phrases
        • Phase 3: Larger verb and noun groups
        • Phase 4: Sentential level rules
    • Statistical sequence labeling
      • HMMs
      • MEMMs

Chap10 Statistical Parsing


  • N-gram models and HMM Tagging only allowed us to process sentences linearly
  • Probabilistic Context Free Grammars(PCFG)
    • alias: Stochastic context-free grammar(SCFG)
    • simplest and most natural probabilistic model for tree structures
    • closely related to those for HMMs
    • 為每一個CFG的規則標示其發生的可能性


  • reduce “right” parse to “most probable parse”
    • Argmax P(Parse|Sentence)

A PCFG consists of

  • set of terminals, {wk}
  • set of nonterminals, {Ni}
  • start symbol N1
  • set of rules
    • {Ni –> ξj}(ξj is a sequence of terminals and nonterminals)
  • probabilities of rules
    • total probability of imply Ni to other sequence ξj is 1
    • ∀i Σj P(Ni → ξj) = 1
  • Probability of sentence according to grammar G
    • P($w_{1m}$) = sum of P($w_{1m}$, t) for every possible tree t
  • Nj dominates the words wa … wb
    • Nj → wa … wb

Assumptions of the Model

  • Place Invariance
    • probability of a subtree does not depend on its position in the string
    • similar to time invariance in HMMs
  • Ancestor Free
    • probability of a subtree does not depend on nodes in the derivation outside the subtree(subtree的機率只和subtree內的node有關)
    • can simplify probability calculation

Questions of PCFGs(similar to three questions of HMM)

  • Assign probabilities to parse trees
    • What is the probability of a sentence $w_{1m}$ according to a grammar G
      • P(w1m|G)
  • Parsing with probabilities(Decoding)
    • What is the most likely parse for a sentence
      • argmax_t P(t|w1m,G)
      • How to efficiently find the best (or N best) trees
  • Training the model (Learning)
    • How to set rule probabilities(parameter of grammar model) that maximize the probability of a sentence
      • argmax_G P(w1m|G)

Simple Probability Model

  • probability of a tree is the product of the probabilities of rules in derivation
  • Rule Probabilities
    • S → NP
    • P(NP | S)
  • Training the Model
    • estimate probability from data
    • P(α → β | α) = Count(α→β) / Count(α) = Count(α→β) / Σγ Count(α→γ)
  • Parsing (Decoding)
    • trees with highest probability in the model
  • Example: Book the dinner flight
    • too slow!

Dynamic Programming again

  • use CKY and Earley to parse
  • Viterbi and HMMs to get the best parse
  • Parameters of a PCFG in Chomsky Normal Form
    • P(Nj→NrNs | G) , $n^3$ matrix of parameters
    • P(Nj→wk | G), $nV$ parameters
    • n is the number of nonterminals
    • V is the number of terminals
  • Σr,s P(Nj→NrNs) + ΣkP(Nj→wk) = 1
    • 所有由Nj導出的rule,機率總和必為1

Probabilistic Regular Grammars (PRG)

  • start state N1
  • rules
    • Ni → wjNk
    • Ni → wj
  • PRG is a HMM with [start state] and [finish(sink) state]

Inside and Outside probability

  • Forward(Outside) probability
    • $ α_i(t) = P(w_{1(t-1)}, X_t = i)$
    • everything above a certain node(include the node)
  • Backward(Inside) probability
    • $ β_i(t, T) = P(w_{tT} | X_t = i)$
    • everything below a certain node
    • total probability of generating words $w_t \cdots w_T$, given the root nonterminal $N^i$ and a grammar G

Inside Algorithm (bottom-up)

  • $P(w_{1m} | G) = P(N_1 → w_{1m} | G) = P(w_{1m} | N^1_{1m}, G) = B_1(1,m)$
    • $B_1(1,m)$ is Inside probability
      • P(w1~wm are below N1(start symbol))
  • base rule
    • $ B_j(k, k) = P(w_k | N^j_{kk}, G) = P(N^j → w_k | G)$
  • $ B_j(p, q) = P(w_{pq} | N^j_{pq}, G) = $
    • try every possible rules to split Nj, product of **rule probabilty and segments’ inside probabilities **
  • use grid to solve again
      • X軸代表起始座標,Y軸代表長度
        • (2,3) → flies like ants

Outside Algorithm (top-down)

  • $ P(w_{1m} | G) = Σ_j α_j(k, k)P(N^j → w_k$
    • outside probability of wk x (inside) probability of wk of every Nj
  • basecase
    • $ α_1(1, m) = 1, α_j(1,m) = $
    • P(N1) = 1, P(Nj outside w1 to wm) = 0
  • 自己的outside probability 等於
    • 爸爸的outside probability 乘以 爸爸的inside probability 除以 自己的inside probability
      • inside x outside 是固定值?
    • 爸爸的inside probabiliity 除以 自己的inside probability 就是其兄弟的inside probability
    • 使用此公式計算
  • $ α_j(p, q)β_j(p, q) = P(w_{1m}, N^j_{pq} | G) $
    • 某個點的inside 乘 outside = 在某grammar中,出現此句子,且包含此點的機率
    • 所有點的總和:在某grammar下,某parse tree(包含所有node)的機率
  • Outside example: 這些數字理論上算起來會一樣…

Finding the Most Likely Parse for a Sentence

  • δi(p,q)= the highest inside probability parse of a subtree $N_{pq}^i$
  • Initialization
    • δi(p,p)= P(Ni → wp)
  • Induction and Store backtrace
    • δi(p,q)= $argmax(j,k,r)P(Ni→NjNk)δj(p,r)δk(r+1,q)$
    • 找所有可能的切法
  • Termination
    • answer = δ1(1,m)

Training a PCFG

  • find the optimal probabilities among grammar rules
  • use EM Training Algorithm to seek the grammar that maximizes the likelihood of the training data
    • Inside-Outside Algorithm
  • 將產生句子的機率視為π,為Nj產生pq的機率
  • Nj被使用的機率
  • Nj被使用,且Nj→NrNs的機率
  • Nj→NrNs這條rule被使用的機率=前兩式相除
  • Nj→wk
    • 僅分子差異

Problems with the Inside-Outside Algorithm

  • Extremely Slow
    • For each sentence, each iteration of training is $O(m^3n^3)$
  • Local Maxima
  • Satisfactory learning requires many more nonterminals than are theoretically needed to describe the language
  • There is no guarantee that the learned nonterminals will be linguistically motivated

Chap11 Dependency Parsing

COLING-ACL 2006, Dependency Parsing, by Joachim Nivre and Sandra Kuebler
NAACL 2010, Recent Advances in Dependency Parsing, by Qin Iris. Wang and YueZhang
IJCNLP 2013, Dependency Parsing: Past, Present, and Future, by Zhenghua Li, Wenliang Chen, Min Zhang

Dependency Structure vs. Constituency Structure
Parsing is one way to deal with the ambiguity problem in
natural language
dependency syntax is syntactic relations (dependencies)

Constraint: between word pairs
Projective: No crossing links(a word and its dependents form a contiguous substring of the sentence)
An arc (wi , r ,wj ) ∈ A is projective iff wi →∗ wk for all:
i < k < j when i < j
j < k < i when j < i

Non-projective Dependency Trees

  • Long-distance dependencies
  • With crossing links
  • Not so frequent in English
    • All the dependency trees from Penn Treebank are projective
  • Common in other languages with free word order
    • Prague(23%) and Czech, German and Dutch

Data Driven Dependency Parsing

  • Data-driven parsing
    • No grammar / rules needed
    • Parsing decisions are made based on learned models
    • deal with ambiguities well
  • Three approaches
    • Graph-based models
    • Transition-based models(good in practice)
      • Define a transition system for mapping a sentence to its dependency tree
      • Predefine some transition actions
      • Learning: predicting the next state transition, by transition history
      • Parsing: construct the optimal transition sequence
      • Greedy search / beam search
      • Features are defined over a richer parsing history
    • Hybrid models


  • Graph-based models
    • Find the optimal tree from all the possible ones
    • Global, exhaustive
  • Transition-based models
    • Predefine some actions (shift and reduce)
    • use stack to hold partially built parses
    • Find the optimal action sequence
    • Local, Greedy or beam search
  • The two models produce different types of errors

Hybrid Models

  • Three integration methods
    • Ensemble approach: parsing time integration (Sagae & Lavie 2006)
    • Feature-based integration (Nivre & Mcdonald 2008)
    • Single model combination (Zhang & Clark 2008)
  • Gain benefits from both models

Graph-based dependency parsing models

  • Search for a tree with the highest score
  • Define search space
    • Exhaustive search
    • Features are defined over a limited parsing history
  • The score is linear combination of features
    • What features we can use? (later)
    • What learning approaches can lead us to find the best tree with the highest score (later)
  • Applicable to both probabilistic and nonprobabilistic models


  • dynamic features
    • Take into account the link labels of the surrounding word-pairs when predicting the label of current pair
    • Commonly used in sequential labeling
    • A word’s children are generated first(先生child, 再找parent), before it modifies another word

Learning Approaches

  • Local learning approaches
    • Learn a local link classifier given of features defined on training data
    • example
      • 3-class classification: No link, left link or right link
      • Efficient O(n) local training
    • local training and parsing
    • Learn the weights of features
      • Maximum entropy models (Ratnaparkhi 99, Charniak 00)
      • Support vector machines (Yamada & Matsumoto 03)
      • Use a richer feature set!
  • Global learning approaches
  • Unsupervised/Semi-supervised learning approaches
    • Use both annotated training data and un-annotated raw text

Transition-based model

  • Stack holds partially built parses
  • Queue holds unprocessed words
  • Actions
    • use input words to build output parse

parsing processes

Arc-eager parser

  • 4 tranition actions
    • SHIFT: push stack
    • REDUCE: pop stack
    • ARC-LEFT: pop stack and add link
    • ARC-RIGHT: push stack and add link
  • Time complexity: linear
    • every word will be pushed once and popped once(except root)
  • parse
    • by actions: arcleft → arclect subject, noun, …

Arc-standard parser

  • 3 actions
    • SHIFT: push
    • LEFT: pop leftmost stack element and add
    • RIGHT: pop rightmost stack element and add
  • Also linear time


  • neither of parser can solve it
    • online reorder
      • add extra action: swap
      • not linear: $N^2$, but expect to belinear

Decoding algorithms

search action sequence to build the parse
scoring action given context
Candidate item <S, G, Q>

  • greedy local search
    • initialize: Q = input
    • goal: S=[root], G=tree, Q=[]
  • problem: one error leads to incorrect parse
    • Beam search: keep N highest partial states
      • use total score of all actions to rank a parse

Score Models

  • linear model
  • SVM

Chap12 Semantic Representation and Computational Semantics

Semantic aren’t primarily descriptions of inputs

Semantic Processing

  • reason about the truth
  • answer questions based on content
    • Touchstone application is often question answering
  • inference to determine the truth that isn’t actually know


  • principled, theoretically motivated approach
    • Computational/Compositional Semantics
  • limited, practical approaches that have some hope of being useful
    • Information extraction

Information Extraction

Information Extraction = segmentation + classification + association + clustering

  • superficial analysis
    • pulls out only the entities, relations and roles related to consuming application
  • Similar to chunking

Compositional Semantics

  • Use First-Order Logic(FOL) representation that accounts for all the entities, roles and relations present in a sentence
  • Similar to our approach to full parsing
  • Compositional: The meaning of a whole is derived from the meanings of the parts(syntatic)
  • Syntax-Driven Semantic Analysis
    • The composition of meaning representations is guided by the syntactic components and relations provided by the grammars


  • allow to answer yes/no questions
  • allow variable
  • allow inference

Events, actions and relationships can be captured with representations that consist of predicates with arguments

  • Predicates
    • Primarily Verbs, VPs, Sentences
    • Verbs introduce/refer to events and processes
  • Arguments
    • Primarily Nouns, Nominals, NPs, PPs
    • Nouns introduce the things that play roles in those events
  • Example: Mary gave a list to John
    • Giving(Mary, John, List)
    • Gave: Predicate
    • Mary, John, List: Argument
    • better representation
  • Lambda Forms
    • Allow variables to be bound
    • λxP(x)(Sally) = P(Sally)


  • mismatch between syntax and semantics
    • displaced arguments
    • complex NPs with quantifiers
      • A menu
      • Every restaurant
      • Not every waiter
      • Most restaurants
    • still preserving strict compositionality
  • Two (syntax) rules to revise
    • The S rule
      • S → NP VP, NP.Sem(VP.Sem)
      • NP and VP swapped, because S is NP
    • Simple NP’s like proper nouns
      • λx.Franco(x)
  • Store and Retrieve
    • Retrieving the quantifiers one at a time and placing them in front
    • The order determines the meaning
    • retrieve

Set-Based Models

  • domain: the set of elements
  • entity: elements of domain
  • Properties of the elements: sets of elements from the domain
  • Relations: sets of tuples of elements from the domain
  • FOL
    • FOL Terms → elements of the domain
      • Med -> “f”
    • FOL atomic formula → sets, or sets of tuples
      • Noisy(Med) is true if “f is in the set of elements that corresponds to the noisy relation
      • Near(Med, Rio) is true if “the tuple <f,g> is in the set of tuples that corresponds to “Near” in the interpretation
  • Example: Everyone likes a noisy restaurant
    • There is a particular restaurant out there; it’s a noisy place; everybody likes it 有一家吵雜的餐廳大家都喜歡
    • Everybody has at least one noisy restaurant that they like 大家都喜歡一家吵雜的餐廳
    • Everybody likes noisy restaurants (i.e., there is no noisy restaurant out there that is disliked by anyone) 大家都喜歡吵雜的餐廳
    • Using predicates to create categories of concepts
      • people and restaurants
      • basis for OWL (Web Ontology Language)網絡本體語言
    • before
    • after

Chap13 Lexical Semantics

we didn’t do word meaning in compositional semantics


  • meaning and relationship about words
    • hypernym(上位詞)
      • breakfast → meal
    • hierarchies

In our semantics examples, we used various FOL predicates to capture various aspects of events, including the notion of roles
Havers, takers, givers, servers, etc.

Thematic roles(語義關係)

  • semantic generalizations over the specific roles that occur with specific verbs
    • provide a shallow level of semantic analysis
    • tied to syntactic analysis
  • i.e. Takers, givers, eaters, makers, doers, killers
    • They’re all the agents of the actions
  • AGENTS are often subjects
  • In a VP->V NP rule, the NP is often a THEME

2 major English resources using thematic data

  • PropBank
    • Layered on the Penn TreeBank
    • Small number (25ish) labels
  • FrameNet
    • Based on frame semantics
    • Large number of frame-specific labels


  • [McAdams and crew] covered [the floors] with [checked linoleum].格子花紋油毯
    • Arg0 (agent: the causer of the smearing)
    • Arg1 (theme: “thing covered”)
    • Arg2 (covering: “stuff being smeared”)
  • including agent and theme, remaining args are verb specific

Logical Statements

  • Example: EAT – Eating(e) ^Agent(e,x)^ Theme(e,y)^Food(y)
    • (adding in all the right quantifiers and lambdas)
  • Use WordNet to encode the selection restrictions
  • Unfortunately, language is creative
    • … ate glass on an empty stomach accompanied only by water and tea
    • you can’t eat gold for lunch if you’re hungry

can we discover a verb’s restrictions by using a corpus and WordNet?

  1. Parse sentences and find heads
  2. Label the thematic roles
  3. Collect statistics on the co-occurrence of particular headwords with particular thematic roles
  4. Use the WordNet hypernym structure to find the most meaningful level to use as a restriction


Word sense disambiguation

  • select right sense for a word
  • Semantic selection restrictions can be used to disambiguate
    • Ambiguous arguments to unambiguous predicates
    • Ambiguous predicates with unambiguous arguments
  • Ambiguous arguments
    • Prepare a dish(菜餚)
    • Wash a dish(盤子)
  • Ambiguous predicates
    • Serve (任職/服務) Denver
    • Serve (供應) breakfast


  • Supervised Disambiguation
    • based on a labeled training set
  • Dictionary-Based Disambiguation
    • based on lexical resource like dictionaries
  • Unsupervised Disambiguation
    • label training data is expensive
    • based on unlabeled corpora
  • Upper(human) and Lower(simple model) Bounds
  • Pseudoword
    • Generate artificial evaluation data for comparison and improvement of text processing algorithms

Supervised ML Approaches

  • What’s a tag?
    • In WordNet, “bass” in a text has 8 possible tags or labels (bass1 through bass8)
  • require very simple representation for training data
    • Vectors of sets of feature/value pairs
    • need to extract training data by characterization of text surrounding the target
  • If you decide to use features that require more analysis (say parse trees) then the ML part may be doing less work (relatively) if these features are truly informative
  • Classification
    • Naïve Bayes (the right thing to try first)
    • Decision lists
    • Decision trees
    • MaxEnt
    • Support vector machines
    • Nearest neighbor methods…
    • choice of technique depends on features that have been used
  • Bootstrapping
    • Use when don’t have enough data to train a system…
    • 集中有放回的均勻抽樣

Naive Bayes

  • Argmax P(sense|feature vector)
  • find maximum probabilty of words given possible sk
  • assumption
    • bag of words model
      • structure and order of words is ignored
      • each pair of words in the bag is independent
  • 73% correct

Dictionary-Based Disambiguation

  1. Disambiguation based on sense definitions
  2. Thesaurus-Based Disambiguation
  3. Disambiguation based on translations in a second-language corpus

sense definition

  • find keywords in definition of a word
    • cone
      • … pollen-bearing scales or bracts in trees
      • shape for holding ice cream
    • 50%~70% accuracies
    • Alternatives
      • Several iterations to determine correct sense
      • Combine the dictionary-based and thesaurus-based disambiguation

Thesaurus-Based(索引典) Disambiguation

  • Category can determine which word senses are used
  • Each word is assigned one or more subject codes which correspond to its different meanings
    • select the most often subject code
    • 考慮w的context,有多少words的senses與w相同
  • Walker’s Algorithm
    • 50% accuracy for “interest, point, power, state, and terms”
  • Problems
    • general topic categorization, e.g., mouse in computer
    • coverage, e.g., Navratilova
  • Yarowsky’s Algorithm
      1. categorize sentences
      1. categorize words
      1. disambiguate by decision rule for Naïve Bayes
    • result

Disambiguation based on translations in a second-language corpus

  • the word “interest” has two translations in German
    • “Beteiligung” (legal share–50% a interest in the company)
    • “Interesse” (attention, concern–her interest in Mathematics)
  • Example: … showed interest …
    • Look up English-German dictionary, show → zeigen
    • Compute R(Interesse, zeigen) and R(Beteiligung, zeigen)
    • R(Interesse, zeigen) > R(Beteiligung, zeigen)

Unsupervised Disambiguation

P(vj|sk) are estimated using the EM algorithm

  1. Random initialization of P(vj|sk)(word)
  2. For each context ci of w, compute P(ci|sk)(sentence)
  3. Use P(ci|sk) as training data
  4. Reestimate P(vj|sk)(word)

Surface Representations(features)

  • Collocational
    • words that appear in specific positions to the right and left of the target word
    • limited to the words themselves as well as part of speech
    • Example: guitar and bassplayer stand
      • [guitar, NN, and, CJC, player, NN, stand, VVB]
      • In other words, a vector consisting of [position n word, position n part-of-speech…]
  • Co-occurrence
    • words that occur regardless of position
    • Typically limited to frequency counts
    • Assume we’ve settled on a possible vocabulary of 12 words that includes guitarand playerbut not andand stand
    • Example: guitar and bassplayer stand
      • Assume a 12-word sentence includes guitar and player but not “and” and stand
      • [0,0,0,1,0,0,0,0,0,1,0,0]


  • tagging
    • translation
    • information retrieval

different label

  • Generic thematic roles (aka case roles)
    • Agent, instrument, source, goal, etc.
  • Propbank labels
    • Common set of labels ARG0-ARG4, ARGM
    • specific to verb semantics
  • FrameNet frame elements
    • Conceptual and frame-specific
  • Example: [Ochocinco] bought [Burke] [a diamond ring]
    • generic: Agent, Goal, Theme
    • propbank: ARG0, ARG2, ARG1
    • framenet: Customer, Recipe, Goods

Semantic Role Labeling

  • automatically identify and label thematic roles
    • For each verb in a sentence
      • For each constituent
        • Decide if it is an argument to that verb
        • if it is an argument, determine what kind
  • feature
    • from parse and lexical item
    • “path”

Lexical Acquisition

  • Verb Subcategorization
    • the syntactic means by which verbs express their arguments
  • Attachment Ambiguity
    • The children ate the cake with their hands
    • The children ate the cake with blue icing
  • SelectionalPreferences
    • The semantic categorization of a verb’s arguments
  • Semantic Similarity (refer to IR course)
    • Semantic similarity between words

Verb Subcategorization

a particular set of syntactic categories that a verb can appear with is called a subcategorization frame

Brent’s subcategorization frame learner

  1. Cues: Define a regular pattern of words and syntactic categories
    1. ε: error rate of assigning frame f to verb v based on cue cj
  2. Hypothesis Testing: Define null hypothesis H0: “the frame is not appropriate for the verb”
    1. Reject this hypothesis if the cue cj indicates with high probability that our H0 is wrong


  • regular pattern for subcategorization frame “NP NP”

    • (OBJ | SUBJ_OBJ | CAP) (PUNC |CC)
      Null hypothesis testing
  • Verb vi occurs a total of n times in the corpus and there are m < n occurrences with a cue for frame fj

  • Reject the null hypothesis H0 that vi does not accept fj with the following probability of error

  • Brent’s system does well at precision, but not well at recall

  • Manning’s system

    • solve this problem by using a tagger and running the cue detection on the output of the tagger
    • learn a lot of subcategorization frames, even those it is low-reliability
    • still low performance
    • improve : use prior knowledge

PCFG prefers to parse common construction

  • P(A|prep, verb, np1, np2, w) ~= P(A|prep, verb, np1, np2)
    • Do not count the word outside of frame
    • w: words outside of “verb np1(prep np2)”
    • A: random variable representing attachment decision
    • V(A): verb or np1
    • Counter example
      • Fred saw a movie with Arnold Schwarzenegger
  • P(A|prep, verb, np1, np2, noun1, noun2) ~= P(A|prep, verb, noun1, noun2)
    • noun1 = head of np1, noun2 = head of np2
    • total parameters: $10^{13}$ = #(prep) x #(verb) x #(noun) x #(noun)
  • P(A= noun | prep, verb, noun1) vs. P(A= verb | prep, verb, noun1)
    • compare probability to be verb and probability to be noun

Technique: Alternative to reduce parameters

  • Condition probabilities on fewer things
  • Condition probabilities on more general things

The model asks the following questions

  • VAp: Is there a PP headed by p and following the verb v which attaches to v(VAp=1) or not (VAp=0)?
  • NAp: Is there a PP headed by p and following the noun n which attaches to n (NAp=1) or not (NAp=0)?
  • (1) Determine the attachment of a PP that is immediately following an object noun, i.e. compute the probability of NAp=1
  • In order for the first PP headed by the preposition p to attach to the verb, both VAp=1 and NAp=0
    • calculate likelihood ratio between V and N
    • maximum estimation
      • P(VA = 1 | v) = C(v, p) / C(v)
      • P(NA = 1 | n) = C(n, p) / C(n)
  • Estimation of PP attachment counts
    • Sure Noun Attach
      • If a noun is followed by a PP but no preceding verb, increment C(prep attached to noun)
    • Sure Verb Attach
      • if a passive verb is followed by a PP other than a “by” phrase, increment C(prep attached to verb)
      • if a PP follows both a noun phrase and a verb but the noun phrase is a pronoun, increment C(prep attached to verb)
    • Ambiguous Attach
      • if a PP follows both a noun and a verb, see if the probabilities based on the attachment decided by previous way
      • otherwise increment both attachment counters by 0.5
    • Sparse data is a major cause of the difference between the human and program performance(attachment indeterminacy不確定性)

Using Semantic Information

  • condition on semantic tags of verb & noun
    • Sue bought a plant with Jane(human)
    • Sue bought a plant with yellow leaves(object)

The noun phrase serves as the subject of the relative clause

  • collect “ subject-verb” and “verb-object” pairs.(training part)
  • compute t-score (testing part)
    • t-score > 0.10 (significant)

P (relative clause attaches to x | main verb of clause =v) > P (relative clause attaches to y | main verb of clause=v)
↔ P (x= subject/object | v) > P (y= subject/ object|v)

Selectional Preferences

  • Most verbs prefer particular type of arguments
    • eat → object (food item)
    • think → subject (people)
    • bark → subject (dog)
  • Aspects of meaning of a word can be inferred
    • Susan had never eaten a fresh durian before (food item)
  • Selectional preferences can be used to rank different parses of a sentence
  • Selectional preference strength
    • how strongly the verb constrains its direct object
    • KL divergence between the prior distribution of direct objects of general verb and the distribution of direct objects of specific verb
    • 2 assumptions
      • only the head noun of the object is considered
      • rather than dealing with individual nouns, we look at classes of nouns
  • Selectional association
    • Selectional Association between a verb and a class is this class’s contribution to S(v) / the overall preference strength S(v)
    • There is also a rule for assigning association strengths to nouns instead of noun classes
      • If noun belongs to several classes, then its choose the highest association strength among all classes
    • estimating the probability that a direct object in noun class c occurs given a verb v
      • A(interrupt, chair) = max(A(interrupt, people), A(interrupt, furniture)) = A(interrupt, people)
  • Example
    • eat prefers fooditem
      • A(eat, food)=1.08 → very specific
    • seehas a uniform distribution
      • A(see, people)=A(see, furniture)=A(see, food)=A(see, action)=0 → no selectional preference
    • find disprefers action item
      • A(find, action)=-0.13 → less specific

Semantic Similarity

  • assessing semantic similarity between a new word and other already known words
  • Vector Space vs Probabilistic
  • Vector Space
    • Words can be expressed in different spaces: document space, word spaceand modifier space
    • Similarity measures for binary vectors: matching coefficient, Dice coefficient, Jaccard(or Tanimoto) coefficient, Overlap coefficientand cosine
    • Similarity measures for the real-valued vector space: cosine, Euclidean Distance, normalized correlation coefficient
      • cosine assumes a Euclidean space which is not well-motivated when dealing with word counts
  • Probabilistic Measures
    • viewing word counts by representing them as probability distributions
    • compare two probability distributions using
      • KL Divergence
      • Information Radius(Irad)
      • L1Norm

Chap14 Computational Discourse

Level Well-formedness constraints Types of ambiguity
Lexical Rules of inflection and derivation
structural, morpheme boundaries, morpheme identity
Syntactic Grammar rules structural, POS
Semantic Selection restrictions word sense, quantifier scope
Pragmatic conversation principles pragmatic function

Computational Discourse

  • Discourse(語篇)
    • A group of sentences with the same coherence relation
  • Coherence relation
    • the 2nd sentence offers the reader an explaination or cause for the 1st sentence
  • Entity-based Coherence
    • relationships with the entities, introducing them and following them in a focused way
    • Discourse Segmentation
      • Divide a document into a linear sequence of multiparagraph passages
      • Academic article
        • Abstract
        • Introduction
        • Methodology
        • Results
        • Conclusion
      • Inverted Pyramid
      • Applications
        • News
        • Summarize different segments of a document
        • Extract information from inside a single discourse segment

TextTiling (Hearst,1997)

  • Tokenization
    • Each space-delimited word in the input is converted to lower-case
    • Words in a stop list of function words are thrown out
    • The remaining words are morphologically stemmed
    • The stemmed words are grouped into pseudo-sentencesof length w = 20
  • Lexical score determination
    • compute a lexical cohesion(結合) score between pseudo-sentences
      • score: average similarity of words in the pseudo-sentences before gap to pseudo-sentences after the gap(??)
  • Boundary identification
    • Compute a depth score for each gap
    • Boundaries are assigned at any valley which is deeper than a cutoff

Coherence Relations

  • Result
    • The Tin Woodman was caught in the rain. His joints rusted
  • Explanation
    • John hid Bill’s car keys. He was drunk
  • Parallel
    • The Scarecrow wanted some brains
    • The Tin Woodman wanted a heart
  • Elaboration(詳細論述)
    • Dorothy was from Kansas
    • She lived in the midst of the great Kansas prairies
  • Occasion(起因)
    • Dorothy picked up the oil-can
    • She oiled the Tin Woodman’s joints

Coherence Relation Assignment

  • Discourse parsing
  • Open problems

Cue-Phrase-Based Algorithm

  • Using cue phrases

    • Segment the text into discourse segments
    • Classify the relationship between each consecutive discourse
  • Cue phrase

    • connectives, which are often conjunctions or adverbs
      • because, although, but, for example, yet, with, and
  • discourse uses vs. sentential uses

    • With its distant orbit, Mars exhibits frigid weather conditions. (因為長距離的運行軌道,火星天氣酷寒)
    • We can see Mars with an ordinary telescope
  • Temporal Relation

    • ordered in time (Asynchronous)
      • before, after …
    • overlapped (Synchronous)
      • at the same time
  • Contingency Relation

    • 因果關係,附帶條件
  • Comparison Relation

    • difference between two arguments
  • Expansion Relation

    • expands the information for one argument in the other one or continues the narrative flow
  • Implicit Relation

    • Discourse marker is absent
    • 颱風來襲,學校停止上課
  • Chinese Relation Words

    • Ambiguous Discourse Markers
      • 而:而且, 然而, 因而

Reference Resolution

  • Evoke
    • When a referent is first mentioned in a discourse, we say that a representation for it is evoked into the model
  • Access
    • Upon subsequent mention, this representation is accessed from the model

Five Types of Referring Expressions

  • Indefinite Noun Phrases(不定名詞)
    • marked with the determiner a, some, this …
    • Create a new internal symbol and add to the current world model
      • Mayumi has bought a new automobile
      • automobile(g123)
      • new(g123)
      • owns(mayumi, g123)
    • non-specific sense to describe an object
      • Mayumi wantsto buy a new XJE
    • whole classes of objects
      • A new automobiletypically requires repair twice in the first 12 months
    • collect one or more properties
      • The Macho GTE XL is a new automobile
    • Question and commands
      • Is her automobile in a parking placenear the exit?
      • Put her automobile into a parking placenear the exit!
  • Definite Noun Phrases(定名詞)
    • simple referential and generic uses(the same as indefinite)
    • indicate an individual by description that they satisfy
      • The manufacturer of this automobile should be indicted
  • Pronouns(代名詞)
    • reference backs to entities that have been introduced by previous nounphrases in a discourse
    • non-referential noun phrase
      • non-exist object
    • logical variable
      • No male driveradmits that heis incompetent
    • something that is available from the context of utterance, but has not been explicitly mentioned before
      • Here they come, late again!
      • Can’t easily know who are “they”
    • Anaphora
      • Number Agreements
        • John has a Ford Falcon. It is red
        • John has three Ford Falcons. They are red
      • Person Agreement(人稱)
      • Gender Agreement
      • Selection Restrictions
        • verb and its arguments
  • Demonstratives (指示詞)
    • this, that
  • Names
    • Full name > long definite description > short definite description > last name> first name > distal demonstrative > proximate demonstrative > NP > stressed pronoun > unstressed pronoun

Information Status

  • Referential forms used to provide new or old information
  • givenness hierarchy
  • Definite-indefinite is a clue to given-new status
    • The sales managere(given) employed a foreign distributor(new)
  • If there are ambiguous noun phrases in a sentence, then it extracts the presuppositions to provide extra constraints
  • When some new information is added to knowledge base, check if it is consistent with what we already know

Active model of understanding

  • Given a text, build up predictions or expectations about new information and actively compare these with successive input to resolve ambiguities
  • Construct a proof of the information provided in a sentence from the existing world knowledge and plausible inference rules illustrated
  • the inference are not sensitive to the order
    • if the proposition that the disc is heavy is inferred, then it is not changed after the discourse has finished
    • Solution: describe the propositions in temporal order
  • Script: encapsulate a sequence of actions that belong together into a script
    <{customer(C), automobile(A), dealer(D), garage(G)},
    goes(C, G),
    test_drives(C, A),
    orders(C, A, D),
    delivers(D, A, C),
    drives(C, A)


  • HHChen 課堂講義