Text Summarization and Information Extraction: das2007survey A Survey on Automatic Text Summarization

Unpublished paper
A Survey on Automatic Text Summarization
Dipanjan Das and André F.T. Martins
Literature Survey for the Language and Statistics II course at Carnegie Mellon University, 2007

giving special emphasis to empirical methods and extractive techniques.
Some promising approaches that concentrate on specific details of the summarization problem are also discussed. Special attention is devoted to automatic evaluation of summarization systems, as future research on summarization is strongly dependent on progress in this area.

1 Introduction

extraction is the procedure of identifying important sections of the text and producing them verbatim;
abstraction aims to produce important material in a new way;
fusion combines extracted parts coherently;
compression aims to throw out unimportant sections of the text (Radev et al., 2002).

Earliest ...scienti c documents ...extracting salient sentences ... using features like
word and phrase frequency (Luhn, 1958),
position in the text (Baxendale, 1958) and
key phrases (Edmundson, 1969).

extractive summarization is mainly concerned .. summary content
abstractive summarization .. grammatical summary.. advanced language generation techniques.

A crucial issue that will certainly drive future research on summarization is evaluation.

this survey, .. how empirical methods have been used to build summarization systems

2 Single-Document Summarization
describe some eminent extractive techniques.
we look at
- early work from the 1950s and 60s that kicked o research on summarization.
- concentrate on approaches involving machine learning techniques published in the 1990s to today.
- some techniques that use a more complex natural language analysis to tackle the problem.

2.1 Early Work
focused on technical documents.
the most cited (Luhn, 1958)

"In his work, Luhn proposed that the frequency of a particular word in an article provides an useful measure of its significance. There are several key ideas put forward in this paper that have assumed importance in later work on summarization. As a first step, words were stemmed to their root forms, and stop words were deleted. Luhn then compiled a list of content words sorted by decreasing frequency, the index providing a signi cance measure of the word.

On a sentence level, a signi cance factor was derived that reflects the number of occurrences of significant words within a sentence, and the linear distance between them due to the intervention of non-signi cant words. All sentences are ranked in order of their significance factor, and the top ranking sentences are finally selected to form the auto-abstract.

Edmundson (1969) describes a system that produces document extracts. His primary contribution was the development of a typical structure for an extractive summarization experiment. At first, the author developed a protocol for creating manual extracts, that was applied in a set of 400 technical documents. The two features of word frequency and positional importance were incorporated from the previous two works. Two other features were used: the presence of cue words (presence of words like significant, or hardly), and the skeleton of the document (whether the sentence is a title or heading).

2.2 Machine Learning Methods
initially most systems assumed feature independence and relied on naive-Bayes methods, others have focused on the choice of appropriate features and on learning algorithms that make no independence assumptions. Other signi cant approaches involved hidden Markov models and log-linear models to improve extractive summarization. A very recent paper, in contrast, used neural networks and third party features (like common words in search engine queries) to improve purely extractive single document summarization.

Kupiec et al. (1995) describe a method derived from Edmundson (1969) that is able to learn from data. The classification function categorizes each sentence as worthy of extraction or not, using a naive-Bayes classifier

Aone et al. (1999) also incorporated a naive-Bayes classifier, but with richer features. They describe a system called DimSum that made use of features like term frequency (tf ) and inverse document frequency (idf) to derive signature words.

Statistically derived two-noun word collocations were used as units for counting, along with single words. A named-entity tagger was used and each entity was considered as a single token. They also employed some shallow discourse analysis like reference to same entities in the text, maintaining cohesion.

Synonyms and morphological variants were also merged while considering lexical terms.

Lin and Hovy (1997) studied the importance of a single feature, sentence position. Just weighing a sentence by its position in text, which the authors term as the "position method", arises from the idea that texts generally follow a predictable discourse structure, and that the sentences of greater topic centrality tend to occur in certain specifiable locations (e.g. title, abstracts, etc).

However, since the discourse structure signi cantly varies over domains, the position method cannot be defined as naively as in (Baxendale, 1958).

The paper makes an important contribution by investigating techniques of tailoring the position method towards optimality over a genre and how it can be evaluated for eeffectiveness.

A newswire corpus was used, the collection of Ziff -Davis texts produced from the TIPSTER program; it consists of text about computer and related hardware, accompanied by a set of key topic words and a small abstract of six sentences. For each document in the corpus, the authors measured the yield of each sentence position against the topic keywords. They then ranked the sentence positions by their average yield to produce the Optimal Position Policy (OPP) for topic positions for the genre.

Two kinds of evaluation were performed. Previously unseen text was used for testing whether the same procedure would work in a different domain. The first evaluation showed contours exactly like the training documents. In the second evaluation, word overlap of manual abstracts with the extracted sentences was measured. Windows in abstracts were compared with windows on the selected sentences and corresponding precision and recall values were measured. A high degree of coverage indicated the effectiveness of the position method.

In later work, Lin (1999) broke away from the assumption that features are independent of each other and tried to model the problem of sentence extraction using decision trees, instead of a naive-Bayes classifier.."
Some novel features were the query signature (normalized score given to sentences depending on number of query words that they contain), IR signature

HMM
"In contrast with previous approaches, that were mostly feature-based and nonsequential, Conroy and O'leary (2001) modeled the problem of extracting a sentence from a document using a hidden Markov model (HMM). The basic motivation for using a sequential model is to account for local dependencies between sentences. Only three features were used: position of the sentence in the document (built into
the state structure of the HMM), number of terms in the sentence, and likeliness of the sentence terms given the document terms.

Log-Linear Models

"Osborne (2002) claims that existing approaches to summarization have always assumed feature independence. The author used log-linear models to obviate this assumption and showed empirically that the system produced better extracts than a naive-Bayes model, with a prior appended to both models.

Neural Networks and Third Party Features
hasil paling bagus, bisa secara signifikan mengalahkan baseline

2.3 Deep Natural Language Analysis Methods

None of these papers solve the problem using machine learning, but rather use a set of heuristics to create document extracts. Most of these techniques try to model the text's discourse structure.

3 Multi-Document Summarization
.. since mid 1990s, .. news articles. .. Google News, NewsBlaster, News In Essence.
..multiple sources of information that overlap and supplement each other, being contradictory at occasions.
.. key tasks are not only identifying and coping with redundancy across documents, but also recognizing novelty and ensuring that the final summary is both coherent and complete.

.. pioneered .. Columbia University (McKeown and Radev, 1995), SUMMONS
Extractive techniques .. use of similarity measures between pairs of sentences.
Approaches vary on how these similarities are used: some
- identify common themes through clustering and then select one sentence to represent each cluster,
- generate a composite sentence from each cluster,
- dynamically by including each candidate passage only if it is considered novel with respect to the previous included passages, via MMR

Some recent work extends multi-document summarization to multilingual environments.

3.1 Abstraction and Information Fusion

3.2 Topic-driven Summarization and MMR
3.3 Graph Spreading Activation
3.4 Centroid-based Summarization

3.5 Multilingual Multi-document Summarization

4 Other Approaches to Summarization
some unconventional approaches that investigate some details that underlie the summarization process, have a role to play in future research on this eld.

4.1 Short Summaries (headline)
extractive summarization is not very powerful for very short summaries
Witbrock and Mittal (1999) statistical models, to model both the order and the likelihood of the appearance of tokens in the target documents. used to co-constrain.
(Brown et al., 1993) : translation model between a document and its summary. a mapping between a word in the document and the likelihood of some word appearing in the summary

Evaluation: compared against the actual headlines for a set of input newswire stories. Since
phrasing could not be compared, they compared the generated headlines against the actual headlines, as well as the top ranked summary sentence of the story.

4.2 Sentence Compression
step to later tackle the problems of more complex single and multi-document summarization.
techniques such as: noisy-channel model, decision trees.

4.3 Sequential document representation

document representation, with applications in summarization
bag-of-words representation
vector space model salton1975vector
simplicial curve lebanon2007locally

5 Evaluation
Evaluating a summary is a difficult task because there does not exist an ideal summary for a given document or set of documents.
Agreement between human summarizers is quite low, both for evaluating and generating summaries.
Widespread use of disparate metrics.

5.1 Human and Automatic Evaluation
ISI SEE

5.2 ROUGE
n-gram overlap, isnpired from BLEU machine translation

5.3 Information-theoretic Evaluation of Summaries
performance is better for multi-doc summary

6 Conclusion
still a long trail to walk in this field.
attention drifted from summarizing scienti c articles to news articles, electronic mail messages, advertisements, and blogs.
Both abstractive and extractive approaches have been attempted
Abstractive .. requires heavy machinery for language generation, difficult to extend to broader domains.
Extraction of sentences .. satisfactory results in large-scale applications, specially in multi-document summarization.

This survey emphasizes extractive approaches to summarization using statistical methods.
A lot of interesting work is being done far from the mainstream research in this field .. relevant to future research, even if they focus only on small details related to a general summarization ..

some recent trends in automatic evaluation of summarization systems

method

Text Summarization and Information Extraction

Sunday, April 11, 2010

das2007survey A Survey on Automatic Text Summarization

No comments:

Post a Comment

Labels

Followers

Blog Archive