Text Summarization and Information Extraction: mckeown1995generating & radev1998generating Generating summaries of multiple news articles

McKeown, K. R. and Radev, D. R. (1995). Generating summaries of multiple news articles. In Proceedings of SIGIR '95, pages 74-82, Seattle, Washington.

Radev, D. R. and McKeown, K. (1998). Generating natural language summaries from multiple on-line sources. Computational Linguistics, 24(3):469-500.

We present a methodology for summarization of news about current events in the form of briefings that include appropriate background (historical) information. The system that we developed, SUMMONS, uses the output of systems developed for the DARPA Message Understanding Conferences (MUC) to generate summaries of multiple documents on the same or related events, presenting similarities and differences, contradictions, and generalizations among sources of information. We describe the various components of the system, showing how information from multiple articles is combined, organized into a paragraph, and finally, realized as English sentences. A feature of our work is the extraction of descriptions of entities such as people and places for reuse to enhance a briefing.

= = = == = = = = =
[das2007survey]
As far as we know, SUMMONS (McKeown and Radev, 1995; Radev and McKeown, 1998) is the first historical example of a multi-document summarization system. It tackles single events about a narrow domain (news articles about terrorism) and produces a briefing merging relevant information about each event and how reports by different news agencies have evolved over time.

The whole thread of reports is then presented, as illustrated in the following example of a "good" summary:

"In the afternoon of February 26, 1993, Reuters reported that a suspect bomb killed at least five people in the World Trade Center. However, Associated Press announced that exactly five people were killed in the blast. Finally, Associated Press announced that Arab terrorists were possibly responsible for the terrorist act."

Rather than working with raw text, SUMMONS reads a database previously built by a template-based message understanding system. A full multi-document summarizer is built by concatenating the two systems, first processing full text as input and filling template slots, and then synthesizing a summary from the extracted information. The architecture of SUMMONS consists of two major components: a content planner that selects the information to include in the summary through combination of the input templates, and a linguistic generator that selects the right words to express the information in grammatical and coherent text.

Linguistic generator was devised by adapting existing language generation tools, namely the FUF/SURGE system.

Content planning is made through summary operators, a set of heuristic rules that perform operations like "change of perspective", "contradiction", "refinement", etc. Some of these operations require resolving conflicts, i.e., contradictory information among different sources or time instants; others complete pieces of information that are included in some articles and not in others, combining them into a single template. At the end, the linguistic generator gathers all the combined information and uses connective phrases to synthesize a summary.

While this framework seems promising when the domain is narrow enough so that the templates can be designed by hand, a generalization for broader domains would be problematic. This was improved later by McKeown et al. (1999) and Barzilay et al. (1999), where the input is now a set of related documents in raw text, like those retrieved by a standard search engine in response to a query.

The system starts by identifying themes, i.e., sets of similar text units (usually paragraphs). This is formulated as a clustering problem. To compute a similarity measure between text units, these are mapped to vectors of features, that include single words weighted by their TF-IDF scores, noun phrases, proper nouns, synsets from the Wordnet database and a database of semantic classes of verbs. For each pair of paragraphs, a vector is computed that represents matches on the different features. Decision rules
that were learned from data are then used to classify each pair of text units either as similar or dissimilar; this in turn feeds a subsequent algorithm that places the most related paragraphs in the same theme.

Once themes are identified, the system enters its second stage: information fusion. The goal is to decide which sentences of a theme should be included in the summary. Rather than just picking a sentence that is a group representative, the authors propose an algorithm which compares and intersects predicate argument structures of the phrases within each theme to determine which are repeated often enough to be included in the summary. This is done as follows: first, sentences are parsed through Collins' statistical parser (Collins, 1999) and converted into dependency trees, which allows capturing the predicate-argument structure and identify functional roles. Determiners and auxiliaries are dropped; Fig. 3 shows a sentence representation.

The comparison algorithm then traverses these dependency trees recursively, adding identical nodes to the output tree. Once full phrases (a verb with at least two constituents) are found, they are marked to be included in the summary. If two phrases, rooted at some node, are not identical but yet similar, the hypothesis that they are paraphrases of each other is considered; to take this into account, corpus driven
paraphrasing rules are written to allow paraphrase intersection. Once the summary content (represented as predicate-argument structures) is decided, a grammatical text is generated by translating those structures into the arguments expected by the FUF/SURGE language generation system
= = = = = = = = = =

1. Introduction
Some characteristics that distinguish a briefing from the general concept of a summary are:
- ..keep a person up to date on a certain event. ..
- ..focus on certain types of information .. more user-centered than general summaries

novel techniques:
• It briefs the user on information of interest using tools related to information extraction, conceptual combination, and text generation.
• It combines information from multiple news articles into a coherent summary using symbolic techniques.
• It augments the resulting summaries using descriptions of entities obtained from on-line sources.

In order to extract information of interest to the user, SUMMONS makes use of components from several MUC systems

The right side of the figure shows how proper nouns and their descriptions are
extracted from past news. An entity extractor identifies proper nouns in the past
newswire archives, along with descriptions. Descriptions are then categorized using
the WordNet hierarchy. Finally, an FD or functional description (Elhadad 1993) for
the description is generated so that it can be reused in fluent ways in the final summary.
FDs mix functional, semantic, syntactic, and lexical information in a recursive
attribute-value format that serves as the basic data structure for all information within
FUF / SURGE.

2. Related Work
3. System Overview

The full content is then passed through a sentence
generator, implemented using the FUF/SURGE language generation system (Elhadad
1993; Robin 1994). FUF is a functional unification formalism that uses a large systemic
grammar of English, called SURGE, to fill in syntactic constraints, build a syntactic
tree, choose closed class words, and eventually linearize the tree as a sentence.

4. Generating the Summary

eight different planning operators,

4.1 Overview of the Summarization Component
two main components:
-content planner: selects information from an underlying knowledge base to include in a text
produces a conceptual representation of text meaning (e.g., a frame, a logical form, or an internal representation of text) and typically does not include any linguistic information.
-linguistic component: selects words to refer to concepts contained in the selected information and arranges those words, appropriately inflecting them, to form an English sentence.
uses a lexicon and a grammar of English

4.2 Methodology: Collecting and Using a Summary Corpus
used available online corpora... then.. extracted manually, and after careful investigation, several hundred language constructions that we found relevant to the types of summaries we want to produce.
cue phrases collected from the corpus,

4.3 Summary Operators for Content Planning
developed a set of heuristics derived from the corpora that decide what
types of simple sentences constitute a summary, in what order they need to be listed, as
well as the ways in which simple sentences are combined into more complex ones.

4.4 Algorithm
4.4.1 Input.
4.4.2 Preprocessing.
4.4.3 Heuristic Combination.
4.4.4 Discourse Planning.
4.4.5 Ordering of TempIates and Linguistic Generation.

4.5 An Example of System Operation

Article 1:

Article 2:
Article 3:
Article 4:

Figure 7 Template for article one.
Figure 8 Template for article two.
Figure 9 Template for article three.
Figure 10 Template for article four.

templates are generated manually from the input newswire texts.

Figure 11.
The first two sentences are generated from template one. The subsequent sentences
are generated using different operators that are triggered according to changing values
for certain attributes in the three remaining templates.

5. Generating Descriptions
profile manager

5.1 Creation of a Database of Profiles
5.1.1 Extraction of Entity Names from Old Newswire.
5.1.2 Extraction of Descriptions.

5.1.3 Categorization of Descriptions.
WordNet to group extracted descriptions into categories. e.g. "profession, "nationality," "organization." Each of these concepts is triggered by one or more words (which we call trigger terms) in the description.

5.1.4 Organization of Descriptions in a Database of Profiles.

5.2 Generation of Descriptions
improved summary by merging information extracted from the input
articles with information from the other sources (Radev and McKeown 1997).

5.2.1 Transformation of Descriptions into Functional Descriptions.
5.2.2 Regenerating Descriptions.

6. System Status
6.1 Summary Generation

6.2 The Description Generator
6.3 Portability
6.4 Suggested Evaluation

7. Future Work

8. Conclusion

Our prototype system demonstrates the feasibility of generating briefings of a series of domain-specific news articles on the same event, highlighting changes over time as well as similarities and differences among sources and including some historical information about the participants. The ability to automatically provide summaries of heterogeneous material will critically help in the effective use of the Internet in order to avoid overload with information. We show how planning operators can be used to
synthesize summary content from a set of templates, each representing a single article. These planning operators are empirically based, coming from analysis of existing summaries, and allow for the generation of concise briefings. Our framework allows for experimentation with summaries of different lengths and for the combination of multiple, independent summary operators to produce more complex summaries with added descriptions.

Text Summarization and Information Extraction

Monday, April 12, 2010

mckeown1995generating & radev1998generating Generating summaries of multiple news articles

No comments:

Post a Comment

Labels

Followers

Blog Archive