Sunday, April 11, 2010

lin1999training Training a selection function for extraction

Lin, C.-Y. (1999). Training a selection function for extraction. In Proceedings of CIKM '99, pages 55-62, New York, NY, USA.

ABSTRACT
In this paper we compare performance of several heuristics in generating informative generic/query-oriented extracts for newspaper articles in order to learn how topic prominence affects the performance of each heuristic. We study how different query types can affect the performance of each heuristic and discuss the possibility of using machine learning algorithms to automatically learn good combination functions to combine several heuristics. We also briefly describe the design, implementation, and
performance of a multilingual text summarization system SUMMARIST.

= = = = = = = = = =
[das2007survey]
In later work, Lin (1999) broke away from the assumption that features are independent of each other and tried to model the problem of sentence extraction using decision trees, instead of a naive-Bayes classi er. He examined a lot of features and their effect on sentence extraction.

The data used in this work is a publicly available collection of texts, classi ed into various topics, provided by the TIPSTER-SUMMAC evaluations, targeted towards information retrieval systems. The dataset contains essential text fragments (phrases, clauses, and sentences) which must be included in summaries to answer some TREC topics. These fragments were each evaluated by a human judge. The experiments described in the paper are with the SUMMARIST system developed at the University of Southern California. The system extracted sentences from the documents and those were matched against human extracts, like most early work on extractive summarization.

Some novel features were the query signature (normalized score given to sentences depending on number of query words that they contain), IR signature (the m most salient words in the corpus, similar to the signature words of (Aone et al., 1999)), numerical data (boolean value 1 given to sentences that contained a number in them), proper name (boolean value 1 given to sentences that contained a proper name in them), pronoun or adjective (boolean value 1 given to sentences that contained a pronoun or adjective in them), weekday or month (similar as previous feature) and quotation (similar as previous feature). It is worth noting that some features like the query signature are question-oriented because of the setting of the evaluation, unlike a generalized summarization framework.

The author experimented with various baselines, like using only the positional feature, or using a simple combination of all features by adding their values. When evaluated by matching machine extracted and human extracted sentences, the decision tree classi er was clearly the winner for the whole dataset, but for three topics, a naive combination of features beat it. Lin conjectured that this happened because
some of the features were independent of each other. Feature analysis suggested that the IR signature was a valuable feature, corroborating the early findings of Luhn (1958).

No comments:

Post a Comment