RBMT – SMT – Hybrid Engines Compared

“Being technology agnostic means using the very best technology for the task, without being bound by a supplier monopoly” — John Papaioannou, CEO of Lexcelera-LexWorks.

Here’s what I would add to that: In order for machine translation to make any sense at all, it has to yield the highest quality that is ‘machinely’ possible.

This is why our approach to machine translation is not tied up to any particular engine. Years of working in a variety of environments have taught us that MT’s benefits rely on coaxing high performance out of your engines. And one size doesn’t fit all! A big part of succeeding with MT is having an open mind and relying on objective measures to match the right engine to the right content.

Being technology agnostic means not promoting just one engine or just one approach. Since LexWorks doesn’t sell any particular technology, we can be totally objective in choosing the best-of-breed solution for the particular content type. Rules rules-based (RBMT), statistical (SMT) or hybrid (HMT): each system has advantages and disadvantages and will perform better in certain situations. Language pair, content type and corpus availabile for training will all impact engine suitability. The only way to be sure you have the best-of-breed is to benchmark all three approaches. And yes, that means before starting any large scale, long-term project, we build three test engines. Once the best performing engine has been selected, we measure and improve on an ongoing basis. In a nutshell, that’s the secret of our success.

Below is a brief discussion of the three engine types.

RBMT, SMT and Hybrid Engines Compared

Rules-based (RBMT) systems come “off the shelf” with grammatical rules hard coded for the source and target languages, and thus customization of RBMT systems aims to embed specific terminology through the application of user dictionaries. Linguistic skill is required to tune RBMT systems. RBMT can be tuned to perform best in narrow (e.g. product level) domains with set terminology. RBMT systems respond particularly well to post-editing because the errors are predictable. Since the terms in the user dictionary will always prevail over any other terminology, post-editing RBMT focuses on improving sentence structure. Significant productivity gains are possible when controlled language is applied. Improvement cycles in RBMT can be implemented weekly, and even daily, as corrections from the post-editors are fed back into the system in near to real-time.

Statistical (SMT) systems are particularly well suited to languages not covered by a rules-based engine because SMT systems are trained on a language pair and domain at the same time. Engineers are mainly responsible for tuning SMT systems. Based on algorithms that parse millions of segments of bilingual and monolingual text to find the most probable translations, SMT is less predictable in what terminology it will deliver, and thus in what kind of errors will result, making it less easy to post-edit. However, SMT sentences tend to be more fluid than RBMT sentences. A big advantage of SMT for user generated content, including FAQs, forums and so on, is that spelling and syntax errors don’t throw SMT off. In fact, if it has been well trained with sufficient in-domain and out-of-domain data, SMT outperforms RBMT for uses such as online customer support, which tends to rely on informal language. SMT improvement cycles tend to be infrequent – once or twice a year – as a large amount of data is needed to (re)tune the system.

Hybrid (HMT) systems tend to combine the best of both approaches. Terminology is predictable and sentences more fluid. Training of a hybrid engine is based on both customizing terminology and processing large quantities of training data. For optimal hybrid quality, two skill sets/profiles are needed for training — linguistic and engineering. Hybrid engines may be improved frequently, without the need to wait for extensive new data sets before being able to improve output with retraining, which is another advantage.

I hope this is helpful. Contact Lori Thicke if you have any questions.