My gut says the same thing, but I've been quite amazed at what audio and video sensing systems are increasingly capable of.
In theory, since we can do it, a sophisticated enough algorithm should be able to do as well or better.
Remember how bad voice recognition systems were 10 years ago? Required training, could only understand the speaker who trained the system, etc. etc. etc.
Now, my new Ford Focus ST (YAY!!! Picked it up Friday...) can understand a great deal of what I, my wife, kids, etc. tell it without any training, and with surprisingly few errors.
In fact, the female voice for MyFord Touch is so real and sexy, we're having an affair during my commute.