![]() When two people speak over each other, it’s not too difficult to follow the conversation if you’re in the same room as they are. Without contextual training or an NLU engine to correct the error, the sentence “I have ants in the kitchen” can be transcribed as “I have aunts in the kitchen.” Understanding different accents and disambiguating homophones go beyond the capabilities of most, if not all ASR systems. The American Heritage Dictionary also recognizes “aunt” - like “daunt” - as a correct pronunciation. For example, I pronounce “aunt” like “ant” the insect. The way we speak varies tremendously, even if we are native speakers of the same language. Whether you realize it or not, you have an accent. In the absence of background noise, other factors significantly impact a machine’s ability to transcribe speech: But for machines, separating speech from background noise – even if it is music – is difficult to do. For humans, the ability to distinguish between speech and background noise is fairly easy - if someone calls me from a concert, I can differentiate the speaker’s voice from the music that’s playing. The issue with WER is that it does not account for the variables that impact speech recognition. Deletion: when a word is omitted from the transcript (for example, “get it done” is transcribed as “get done”).Insertion: when a word is added that wasn’t said (for example, “hostess” is transcribed as “host is”). ![]() Substitution: when a word is replaced (for example, “shipping” is transcribed as “sipping”).Word Error Rate = (Substitutions + Insertions + Deletions) / Number of Words Spoken Word Error Rate is a straightforward concept and simple to calculate – it’s basically the number of errors divided by the total number of words. When comparing conversational AI solutions that automate interactions over telephony, is WER a good metric to gauge how well the virtual agent will understand you? For comparison, human transcriptionists average a word error rate of 4%. Microsoft claims to have a word error rate of 5.1%. ![]() Word Error Rate (WER) is a common metric for measuring speech-to-text accuracy of automatic speech recognition (ASR) systems. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |