First things first, here’s the demo http://ibex.clearfly.net:8080/recognition/Recognition.swf now, on with the blog….
Java and Flex’s capabilities are the natural fit for the next generation of the web (I dare not say Web 3.0) where face recognition sign ons, video chatting, and speech enabled navigation reign king. Hmmmmm you might say, or maybe your thinking about a computer named HAL 9000 from the Space Odyssey saga (fun little fact for my readers, the letter after H is I the letter after A is B and the letter after L is M; IBM). Well I know my imagination was running wild when I finally put my speech activated navigation system together.
How it all began was about a month ago I was asked to see the viability of using a speech recognizer on the web, I was asked to give consideration to a system that could help people learn how to read online. With this I set off on a journey to see what was out there in the open source world to help me along in my endeavors.
The first stop on the road to finding all the pieces to my online speech recognition system was OBVIOUSLY a speech recognition piece of software. I ended up taking about a week to find the speech recognizer I wanted to use. During this week I researched a number of speech recognition items and researched speech recognition in general as well. Speech recognition is a very complex matter with lots of concepts and vocabulary that “speech experts” use when discussing the topic. I felt it was of critical matter to understand terms like Hidden Markov Models (HMM), Utterances, Speech Models, Trained Speech Models, Acoustic ranges, Terrace searching, and number of other related concepts. During this week I evaluated a number of speech recognition tools including Sphinx, Nuance, VoiceBox, and the Microsoft speech server. In reviewing these I was looking firstly at finding the “best of breed” speech recognizer the one(s) most commonly used. In doing this I found that Nuance had the most widely used speech recognizer, however it only had a .NET API and the product is commercial as well, I’m a Java guy and on a tight budget. In reviewing the other open source (targeting java of course) products I found that most of them simply incorporated “CMU Sphinx” as their speech recognizer. So this began the process of evaluating and understanding Sphinx. Sphinx is quite nice and I think you’d be very impressed with what it gives you right out of the box. So there I have it, the first piece of my online speech recognition system. Next stop, feed the recognition system from a web browsers hmmmmmm.
Hold the press!!!
BlazeDS doesn’t support the Real Time Messaging Protocol RTMP (it’s ok I didn’t know what it was either till I needed to use it). RTMP is a how Flash “publishes” streaming video and audio to the server and since we’re using BlazeDS as our remoting end point he was the guy I was looking to, to handle this. Here’s something to keep in mind, the RTMP protocol is now open source THANKS ADOBE! but, BlazeDS doesn’t support it yet! On a side note, it used to be you had to buy the Flash Media Server to interact with video and audio media types but, not anymore there’s a very well known open source media server called Red5. Thank goodness, cause the Flash Media Server has a nice price tag (nice if your Adobe that is). So, now I’ve really got it, I have the plan of attack and goes like this.
Step 1: Publish “utterances” to the Red5 server.
Step 2: Tell our Spring service layer to process this (right now we take the *.flv that’s published and pipe it through ffmpeg to output .wav for Sphinx).
Step 3: Send the response back to the client.
Step 4: Let the client take actions based on what it understood (ie. open a tab or something)
Wanna see it?
Here you go: http://ibex.clearfly.net:8080/recognition/Recognition.swf (once your on the site, there’s some videos to tell you how to use it)
Enjoy the demo and let me know what you think!