Sifting out the meaning
When machines translate texts, the results are often risible rather than right. Computational linguist Alexander Fraser seeks to enable computers to select the most likely meaning of a word or phrase on the basis of its context in the sentence.
If you are feeling glum, try asking Google Translate for an English version of the German sentence: “Die Bank nahe der Bank hat geschlossen”. To the bilingual reader this admittedly somewhat cryptic-looking sentence presents no problems: It says that the bank near the bench has closed. But the first thing the program makes of it is: The bench near the bank has closed – which raises the question of how one might go about closing a bench. Ask for an alternative and you get an equally intriguing, and equally false, version: “The bench near the river bank has closed.”
Alexander Fraser is a specialist in the field of automated translation, and he enjoys such semantic blunders as much as the next man, because they illuminate the challenges that his area of research must confront. “In German, as in English, the term ‘bank’ has several different meanings,” he points out, “and this makes it difficult for computer programs to translate it correctly.” Google Translate is one of the most popular programs of its kind. It is based on a procedure referred to as ‘statistical machine translation’ or SMT, which Fraser also works with. SMT employs statistically based rules that define the relative frequency of occurrence of items of vocabulary in the source and target languages. – And it is quite astonishing how much these programs can do, although many of the resulting translations sound very peculiar, even when they are not downright wrong. Word-for-word translations, which inevitably fail to capture all the semantic nuances of a text, often result in versions that make a native speaker chuckle.
For the past two years, Fraser has headed a research group at LMU’s Center for Information and Speech Processing (CIS), and was recently awarded one of the European Research Council’s coveted Starting Grants to pursue a new project. “We are working on ways to improve the quality of machine translations,” he says. To do so, Fraser makes use of a linguistically richer database than Google Translate to cope with the problems raised by ambiguities in texts that can only be resolved by unraveling the context of the equivocal term.
Problems with ‘smog’ and ‘Grexit’
Fraser subjects his texts to a language-specific analysis, which takes the particular character and idiosyncrasies of the relevant language explicitly into account. Google, in contrast, applies one and the same analytical system to hundreds of different languages, while Fraser adapts his program for each source-target pair. He believes that this extra linguistic investment is worthwhile, particularly in the case of morphologically rich, i.e. complicated, languages such as German. Indeed, “German is one of the most difficult target languages to get right,” he asserts.
In addition to his mother tongue, Fraser himself speaks fluent German, French, Spanish and Arabic. So he is aware of the importance of including all possible types of sentence construction found in natural languages – not just the classical word order Subject-Verb-Object (SVO), but also alternatives like OVS, which is actually very common in German usage. For a trained translator, these different possibilities are quite easy to handle, but computer programs are often stymied by such variations in word order: Negations, definite or indefinite pronouns, compound words or so-called portmanteau nouns like ‘smog’ or ‘Grexit’ that are made up of fragments of two words (which are themselves not immediately identifiable), are particularly prone to derail digital translation programs, often causing them to mislay verbs or omit them altogether.
All this explains why automated translation systems must undergo intensive training, which in turn requires enormous reserves of computing power. To enhance the quality of such programs, computational linguists are using new tools, which have been trained using existing translations and are specialized to cope with particularly difficult issues. These programs not only need to store a very large vocabulary, they also need to be taught how to recognize context-sensitive content in sentences.