As a follow up to my earlier thread about the apparently reworked or, at least, enhanced speech transcription capability in PPro CS5, I just conducted a few tests to see if it is indeed any better than what we had come to expect of it from CS4. The results are, well, encouraging...
QUICK BACKGROUND: I have an approximately two-minute long voiceover, verbatim from a script, professionally recorded, that was originally delivered to me as a 192kbps, 44.1kHz MP3. Since transcription won't work on MPEG assets (at least, not in the trial which is what I'm using--the files import, but the transcription "analyze" button is grayed out), I used Soundbooth to convert it to a 48kHz, 32-bit floating point WAV. I created a couple of copies, so that each one would have its own transcription metadata. I always used the High quality (slower) option.
I ran the test four times, as follows:
- Just the WAV, without a reference script
- The WAV with a verbatim reference script as a text document (.txt), with "Script Text Matches Recorded Dialogue" option checked
- The WAV with a verbatim reference script as an Adobe Story script document (.astx), with "Script Text Matches Recorded Dialogue" option checked
- The original MP3 in Soundbooth CS4
I've linked to a text document with the results of these tests (no editing), along with the original text from the script, if you're interested in seeing how they stack up: right-click and save as.
As one would imagine, #1 (no script) was... not perfect. However, I think it did a decent enough job for most purposes. Granted, this was a professionally-recorded voiceover, with not competing background noise, but it was still marginally useful. What's funny to note is that the transcription process seems to trip up every now and again and then it seems to lose its momentum for awhile. It'll bobble a "difficult" word, and then for half-a-dozen words following, it's all over the place. Once it regains its footing, though, it'll be pretty much spot on for a sentence or so. Unfortunately, this repeats, but again I think you could make use of it.
For #2, the transcription was LEAGUES better... but still not perfect. That part is bewildering to me. The transcription generated a couple weird passages--for example "Enjoy" became "and Gillian"--which I could understand if there was NO reference script. However, my impression of how the reference script should work--particularly if the "Script Text Matches Recorded Dialogue" option is activated--is that the transcription should basically copy and paste the reference script into the metadata, NOT invent words and phrases that don't exist in either the script or the dialogue/speech. This process actually preserved capitalization (though punctuation was discarded), so that means it is definitely using that reference script; there is no way to discern capitalization from the spoken word.
I thought that maybe, just maybe, I could eliminate the anomalies by using Adobe's own script format, generated by the new Adobe Story--hence test #3. I copied and pasted the text into a new Story script, downloaded the ASTX file, and used that as a reference script as above. Unfortunately, it fared no better than the text document (I suppose this makes sense, but I was hoping...), but it was still about 98+% accurate. I'm reasonably satisfied.
For giggles, I tried test #4, where I used Soundbooth CS4 to transcribe the original MP3 file. Since CS4 doesn't have the reference script capability, this would more or less pit the transcription engine in CS4 against that in CS5. As expected, it turn the dialogue into textual chop suey, BUT... it actually did BETTER than CS5 sans reference script did on some phrases! Observe:
- Original script: Our Club is YOUR CLUB.
- CS5, no script: our love is your laugh
- CS4, no script: our club is your class
Mish-mash that only Mad Libs could love! For the record, the tests that used the reference scripts were spot-on, capitalization included.
So, I'm feeling better about using this for the massive documentary and archival project I'm about to embark upon. I have human-typed transcripts for most of the interviews in this project--over 100 interviews, constituting some 200-plus hours--and keeping that text with the footage is going to be a great thing. There are some flaky things--I wish the punctuation could be preserved--but all in all, the reference script addition seems to make this feature much more plausible in a working production environment.