You're not alone in your observations here, Dave. Speech to Text still seems more of a gimmick than a genuinely useful "feature".
I just used it on a project, and it worked pretty well. I had clean audio from XDCAM EX interview footage, audio from some WMV files. and audio from a Quicktime DV file. I'd guess it got about 70% right...maybe a little less. Not sure what I did to get better results...not that this should matter, but where did you do the transcription - AME or Soundbooth? I did it in Soundbooth...
Funny thing is, the speech to text on my Droid X phone is very, very good...I use it for text messages all the time and I very rarely have to make corrections.
I did it within Premiere which I think took me to AME.
I wonder if using Soundbooth yields better results? I can't see how it would - I think they use the same technology. Might be time to run a test....
That sums up my experience also. Unfortunately, it's a good idea, but it's not Dragon Dictate. Here's some thoughts, based on my experience:
- It will work better if there is zero background noise and you have only one speaker who speaks clearly and in a virtual monotone.
- Idiomatic phrases, rare words and jargon will cause the engine to stumble, as its dictionary (at least the English version) seems somewhat limited.
- You must take care in editing the text metadata, as minimal spell corrections and editing can throw off the sync of the text with the clip.
- Make sure you have a very fast computer (or a lot of spare time), if you want to convert a long clip.
All in all, the speech to text tool is a modestly helpful tool that could be really useful, if it is eventually upgraded.
The conversion of speech to text isn't good enough to serve as a transcript on its own. But it is, in my experience, good enough to serve as a useful basis for editing based on the words spoken in a clip. I think that some people have assumed some things about this feature that we don't assume.
There are a couple of demonstrations linked to from near the top of this page that show that with a few manual corrections you can use this feature to place timecode-accurate markers for words spoken in a clip so that you can trigger certain actions at those times. Similarly, you can navigate to words in a sequence to find the places for edits.
There are recommendations here for how to improve the results.