ARCHIVED CONTENT
You are viewing ARCHIVED CONTENT released online between 1 April 2010 and 24 August 2018 or content that has been selectively archived and is no longer active. Content in this archive is NOT UPDATED, and links may not function.By John Tredennick
Many industry professionals doubted that TAR would work on non-English documents. They reasoned that the TAR process was about “understanding” the meaning of documents. It followed that unless the system could understand the documents—and presumably computers understand English—the process wouldn’t be effective.
The doubters were wrong. Computers don’t actually understand documents; they simply catalog the words in documents. More accurately, we call what they recognize “tokens,” because often the fragments (numbers, misspellings, acronyms and simple gibberish) are not even words. The question, then, is whether computers can recognize tokens (words or otherwise) when they appear in other languages.