8.2 - Term Extraction Software

IDevice Icon Reading Activity

The ultimate goal of term extraction is both reducing noise and at the same time preventing silence. These two undesirables may be tackled with the use of the so called STOPLIST (text file containing words that belong to general language and that are unlikely to be parts of multi-word terms).

In computing, a STOPLIST consists of words which are filtered out prior to, or after, processing of natural language data (text). A stoplist is controlled/put together by a human translator. There is not one definite list of STOP WORDS that would be used by all term extraction tools. There are even tools that do not use them at all.

In fact, a stoplist for a given purpose may be created from group of words. For most research projects, these are some of the most common, short function words, such as: the, is, at, which and on, which are usually not important. However, translators should be aware of the fact that such stop words can cause problems when searching for phrases that include them. Particularly in names such as "The Who", "Take That". For some purposes it might be useful to remove some of the most common words—including lexical words, such as "want" in order to narrow down term extraction output.

__________________

Reflection #1: What is a stoplist and what is it good for?

 

The following 2 applications are capable of term extraction, read the text below and do the eXercise for this section.

ExPhrJ - Java Phrase extraction

ExPhrJ is a software tool developed by Prof. Tim Craven of the University of Western Ontario. The sole purpose of this application is to extract term candidates.

The software would look for every word and every phrase up to a certain length that occurs at least a minimum number of times in a source text. Both the phrase length and frequency in the text may be set by the user. and that does not start or end with a stopword.

 

FiveFilters

This is a free software project to enable easy term extraction through a web service. Given some text it will return a list of terms with (hopefully) the most relevant first. It is being developed as part of the Five Filters project to promote alternative media.

__________________

Reflection #2: What is the main difference between the two term extraction applications?

 


For suggested answers to reflection questions click the following button: