Monday, August 16th: Jan Hajic, Charles University, Prague, Czech Republic
Reliving the History: The Beginnings of Statistical Machine Translation and Languages with Rich Morphology
In this two-for-one talk, first some difficult issues in morphology of inflective languages will be presented. Then, to lighten up this linguistically and computationally heavy issue, a half-forgotten history of statistical machine translation will be presented and contrasted with current state-of-the art (in a rather non-technical way).
Computational morphology has been on and off the focus of computational linguistics. Only few of us probably remember the times when developing the proper formalisms has been in such a focus; a history poll might still find out that some people remember DATR-II, or other heavy-duty formalisms for dealing with the (virtually finite) world of words and their forms. Even unification formalisms have been called to duty (and the author himself admits to developing one). However, it is not the morphology itself (not even for inflective or agglutinative languages) that is causing the headache - with today's cheap space and power, simply listing all the thinkable forms in an appropriately hashed list is o.k. - but it's the disambiguation problem, which is apparently more difficult for such morphologically rich languages (perhaps surprisingly more for the inflective ones than agglutinative ones) than for the analytical ones. Since Ken Church's PARTS tagger, statistical methods of all sorts have been tried, and the accuracy of taggers for most languages is deemed pretty good today, even though not quite perfect yet.
However, current results of machine translation are even farther from perfect (not just because of morphology, of course). The current revival of machine translation research will no doubt bring more progress. In the talk, I will try to remember the "good old days" of the original statistical machine translation system Candide, which was being developed at IBM Research since the late 80s, and show that as the patents then filed gradually fade and expire, there are several directions, tweaks and twists that have been used then but are largely ignored by the most advanced systems today (including, but not limited to morphology and tagging, noun phrase chunking, word sense disambiguation, named entity recognition, preferred form selection, etc.). I hope that not only this will bring some light to the early developments in the field of SMT and correct some misconceptions about the original IBM system often wrongly labeled as "word-based", but perhaps also inspire new developments in this area for the future - not only from the point of view of morphologically rich languages.
Wednesday, August 18th: Christiane D. Fellbaum, Princeton University, Princeton, USA
Harmonizing WordNet and FrameNet
Lexical semantic resources are a key component of many NLP systems, whose performance continues to be limited by the "lexical bottleneck." Two large hand-constructed resources, WordNet and FrameNet, differ in their theoretical foundations and their approaches to the representation of word meaning. A core question that both resources address is, how can regularities in the lexicon be discovered and encoded in a way that allows both human annotators and machines to better discriminate and interpret word meanings?
WordNet organizes the bulk of the English lexicon into a network (an acyclic graph) of word form-meaning pairs that are interconnected via directed arcs that express paradigmatic semantic relations. This classification largely disregards syntagmatic properties such as argument selection for verbs. However, a comparison with a syntax-based approach like Levin (1993) reveals some overlap as well as systematic divergences that can be straightforwardly ascribed to the different classification principles. FrameNet's units are cognitive schemas (Frames), each characterized by a set of lexemes from different parts of speech with Frame-specific meanings (lexial units) and roles (Frame Elements). FrameNet also encodes cross-frame relations that parallel the relations among WordNet's synsets.
Given the somewhat complementary nature of the two resources, an alignment would have at least the following potential advantages: (1) both sense inventories are checked and corrected where necessary, and (2) FrameNet's coverage (lexical units per Frame) can be increased by taking advantage of WordNet's class-based organization. A number of automatic alignments have been attempted, with variations on a few intuitively plausible algorithms. Often, the result is limited, as implicit assumptions concerning the systematicity of WordNet's encoding or the semantic correspondences across the resources are not fully warranted. Thus, not all members of a synonym set or a subsumption tree are necessarily Frame mates.
We carry out a manual alignment of selected word forms against tokens in the American National Corpus that can serve as a basis for semi-automatic alignment. This work addresses a persistent, unresolved question, namely, to what extent can humans select, and agree on, the context-appropriate meaning of a word with respect to a lexical resource? We discuss representative cases, their challenges and solutions for alignment as well as initial steps for semi-automatic alignment.
(Joint work with Collin Baker and Nancy Ide)