|
On Techniques and Strategies of Text Matching
The Victorian Laptop posits an important question regarding the nature
of the interaction and a text retrieval system. Unlike traditional information
retrieval, where user need is specifically defined, the Victorian Laptop's
text matching system attempts to transcend tradition measures of recall
and precision to produce what is "relevant."
The difficulty of this task can be imagined merely from the subjectivity
of the term "relevant." What is the user looking for exactly, if anything?
What will stimulate him or her to write more? Such are questions that can
only be answered through experimentation. Furthermore, the texts
being matched differ significantly in genre from the texts in conventional
story matching problems; they are narratives. Most text matching systems
handle news text.
The Victorian Laptop currently extracts two types of information from
user texts.
-
Proper Nouns By performing matches on proper nouns extracted from
the corpus of stories, the general subjects can most often be mapped together.
Proper nouns consisting of more than one word (such as "Boston Common")
are scored more highly than those with only one word (such as "Boston").
The appearance of proper nouns, most often as names of locations and people,
is frequently enough to establish a general link between user input and
reference text.
-
Keywords Using third party text analysis tools, keywords embodying
semantics are extracted. The keywords are collated into groups based on
topic (determined by experimentation). The matcher attempts to assign a
topic to the user text based on the topic keywords found, and then picks a
story from the corpus that is under the same topic and contains the most
keywords found in the user story to return as a match.
-
Dates Another strategy employed by the Victorian Laptop is text
matching based on dates. Experiments show that a temporal link between
user and text is often very important.
The Victorian Laptop`s text matching and retrieval system uses a blend
of the above three techniques to produce what is most "relevant." |