8.1 - Terminology Extraction

Term Extraction
TERM EXTRACTION (Figure 1) is sometimes referred to as term-recognition or term-identification. Most software tools with this functionality are monolingual, and they attempt to analyze source texts in order to identify TERM CANDIDATES (i.e. words that have a high potential for being terms in certain specialized field).
Currently, there are also some bilingual tools either ready for use or under development. These analyze existing source texts along with their translations in an attempt to identify potential terms and their equivalents. This process can help a translator build a termbase more quickly. It must be remembered though, that despite the initial term extraction by computer, the output has to be verified by a human translator. The main difference from building a (key)wordlist is that the terminology extraction tries to find multi-word units.
__________________
Reflection #1: Look at Figure 1 on the right and try to answer the question: If terms end up in the bucket at the bottom, what comes out of the upper drain?
There are two approaches to terms extraction: the linguistic or the statistic. The first approach is about identification of word combinations that match certain part-of-speech patterns. For example, in English, many terms consist of NOUN+NOUN or ADJECTIVE+NOUN combination. The statistical approach, on the other hand, is about looking for repetitions. The so called FREQUENCY THRESHOLD (the number of times that series of words must be repeated) can often be specified by the user. The statistical approach is usually based on mutual information score (MI) which operates on the following hypothesis:
"if two lexical items appear together more often than they appear separately, the multi-word unit may be a term candidate". (Bowker 2008, 83)
Both approaches produce either SILENCE (situations, when terms are not found because the search criteria omit many cases; or not all of the terms will be repeated frequently enough) or NOISE (the criteria of search are too broad that too many useless term candidates are produced).
(Bowker 2008, 82-86)
__________________
Reflection #2: Suppose you have a 20-page document that has to be translated. What frequency threshold will you set to extract most of the terms in that text?