you might be mixing up two different workflows:
- If you are attaching an Adobe Story script to the video and then trigger the speech-to-text, a script alignment workflow is happening. This means that the s2t output is only used to find time codes which could be attached to the story script. In this case the workflow works because the output is simply the original Story script plus timecodes.
- If you are creating a custom language model file from a transcript in the “Analyze Content Dialog” and then run the speech-to-text, you will get speech-to-text raw output as a result. This raw output does not contain punctuation because it is not supported by the engine --- note, that such a feature is quite hard to support for a S2T engine across different languages.
The general advice would be: If you have a Story Script or are able to create it from your material, you should always go this way and follow workflow “1.” above. This workflow is always much more accurate than in case “2.” where the script information does not get aligned but is used to create a language model which makes the s2t process more accurate but not as good as in the alignment case
I'm colleague of Anna. I'm not sure if I understand you.
I developed a tool and implemented following C/C++ code:
cout << "Retrieving metadatas..." << endl;
cout << "Injecting transcription into metadata..." << endl;
meta.AppendArrayItem(kXMP_NS_Script, "dialogSequence", kXMP_PropValueIsArray, NULL, kXMP_PropValueIsStruct);
meta.SetProperty(kXMP_NS_Script, "dialogSequence/xmpScript:character", "SPRECHER", NULL);
meta.SetProperty(kXMP_NS_Script, "dialogSequence/xmpScript:dialog", transcription, NULL);
cout << "Writing metadatas..." << endl;
This is how we attach text into video file. Is this what you call "custom language model" (Workflow #2)?
thank you for your answer. However we are using the the first workflow you described.
When the transcript is added to the video manually via Adobe Story and Adobe OnLocation the speech recognition works fine (output: time stamps to every spoken word, punctuation is not lost).
To create a more automated workflow we want to skip the steps "creating a Adobe Story Script via Adobe Story" and "linking the Video to the Adobe Story script via OnLocation".
Therefore we want to include the transcript to the metadata of the video by using an script before importing it to premiere and starting the speech recognition. To get the same result we looked at the XMP-Data of an Video, which had the transcript included manually (workflow described above) and added the transcript to the same position (code snippet in Flohs answer).
When the video is imported to premiere afterwards and the speech analysis is started we get the following results: timestamps for every spoken word, no words are lost, but punctuation marks are lost.
When starting the speech analysis we are using the same settings: Speech (checked), Language (English), Quality (high (slower)), Reference Script: None.
Do you have an idea why the punctuation marks are lost? Are the more parameters we have to consider when inserting a transcription to the XMP of a video?
Thank you for your help
We found out why punctuation were lost. After I replaced '\n' by ' ' in transcription, now no punctuation is lost and we got output as we need it.
Just curious about it: Why doesn't analyse work with newlines?