Automatic
summarization is the process of shortening a text document with software, in
order to create a summary with the major points of the original document.
Technologies that can make a coherent summary take into account variables such
as length, writing style and syntax.
Automatic
data summarization is part of machine learning and data mining. The main idea
of summarization is to find a subset of data which contains the
"information" of the entire set. Such techniques are widely used in
industry today. Search engines are an example; others include summarization of
documents, image collections and videos. Document summarization tries to create
a representative summary or abstract of the entire document, by finding the
most informative sentences, while in image summarization the system finds the
most representative and important (i.e. salient) images. For surveillance
videos, one might want to extract the important events from the uneventful
context.
There are
two general approaches to automatic summarization: extraction and abstraction.
Extractive methods work by selecting a subset of existing words, phrases, or
sentences in the original text to form the summary. In contrast, abstractive
methods build an internal semantic representation and then use natural language
generation techniques to create a summary that is closer to what a human might
express. Such a summary might include verbal innovations. Research to date has
focused primarily on extractive methods, which are appropriate for image
collection summarization and video summarization.
How to Summarize Text
There are
two main approaches to summarizing text documents; they are:
Extraction-based summarization
In this
summarization task, the automatic system extracts objects from the entire
collection, without modifying the objects themselves. Examples of this include key
phrase extraction, where the goal is to select individual words or phrases to
"tag" a document, and document summarization, where the goal is to
select whole sentences (without modifying them) to create a short paragraph
summary. Similarly, in image collection summarization, the system extracts
images from the collection without modifying the images themselves.
Abstraction-based summarization
Extraction
techniques merely copy the information deemed most important by the system to
the summary (for example, key clauses, sentences or paragraphs), while
abstraction involves paraphrasing sections of the source document. In general,
abstraction can condense a text more strongly than extraction, but the programs
that can do this are harder to develop as they require use of natural language
generation technology, which itself is a growing field.
While
some work has been done in abstractive summarization (creating an abstract
synopsis like that of a human), the majority of summarization systems are
extractive (selecting a subset of sentences to place in a summary).
Summarization
Machine
learning techniques from closely related fields such as information retrieval
or text mining have been successfully adapted to help automatic summarization.
Apart
from Fully Automated Summarizers (FAS), there are systems that aid users with
the task of summarization (MAHS = Machine Aided Human Summarization), for
example by highlighting candidate passages to be included in the summary, and
there are systems that depend on post-processing by a human (HAMS = Human Aided
Machine Summarization).
No comments:
Post a Comment