Software Engineering Challenge from DARPA: build the ultimate speech translation machine

New research in breakthrough technologies is more often than not built to cater to the needs of the military. The internet should serve as a Hall of Fame example in this case. Now, with the current war in Iraq going on, there are only so many translators available to serve on the ground. So, the military is looking for new and improved methods for soldiers to be able to communicate with the locals in their native language. Besides, the military would love software that can listen to say Al-Jazeera or any other Arabic or Chinese broadcasts and translate them into English.

The main components of such a speech translation system and to language learning are reading, writing, listening and speaking. Speech-to-speech translation is a challenging problem, due to poor sentence planning typically associated with spontaneous speech, as well as errors caused by automatic speech recognition. Despite the significant progress made in the last years, performance of text and speech translation technology is still far from being satisfactory.

So, how would a system developed for the military look like? An automatic speech translation system would consist of a speech recognizer, a machine translator, a speech synthesizer and an overall system controlling part.

Block diagram of ETRI's automatic speech translation system

Language translation problems are known to be notoriously hard for computer scientists, but the DARPA sponsored Global Autonomous Language Exploitation – GALE – project may dent the issue given enormous financial backing: IBM with its $6 billion annual research budget, SRI and BBN Technologies with ‘modest’ budgets of a few hundred million each.

But most interestingly speach researchers at Carnegie Mellon University has recently demoed a prototype of a device capable of automatic translation of spoken conversations. The device is dubbed the “Tower of Babel”. Currently the device can handle a small vocabulary of 100-200 words at about 80% accuracy, and accuracy drops off significantly beyond that vocabulary. Unlike the traditional approaches that involve sound processing the device hooks up electrodes to speakers facial muscles in order to recognize the speech from facial movements.

As you can see all these projects are a far cry from what DARPA wants. But given time and money something more advanced would surely come out and eventually would be available for civilian use as well.
Credits:

1. Software and web development blog from Max Fomitchev

2. ETRI automatic speech translation system for traveler aid(PDF)

3. Excellent paper: High quality speech translation for language learning from MIT CS and AI labs