Humans remain crucial for accessible AI-driven tech through captioning.

Humans remain crucial for accessible AI-driven tech through captioning.

The Importance of Human Oversight in AI Captioning Services

AI Captioning

The case for human oversight of artificial intelligence (AI) services continues, with the intertwined world of audio transcription, captioning, and automatic speech recognition (ASR) joining the call for applications that complement, not replace, human input.

The Rise of Captions and Subtitles

Captions and subtitles serve a vital role in providing media and information access to viewers who are deaf or hard of hearing, and their popularity has soared in recent years. Disability advocates have long pushed for better captioning options, a need that has become increasingly relevant with the proliferation of on-demand streaming services. Recognizing the potential of AI in the field, video-based platforms such as YouTube and TikTok have begun exploring AI features such as video summarization and chat bots.

However, incorporating AI tools into automatic captioning is not a straightforward solution, as highlighted in 3Play Media’s recently published 2023 State of Automatic Speech Recognition report. The report emphasizes that users must consider more than just accuracy when utilizing new AI services that are rapidly advancing.

Evaluating the Accuracy of Automatic Speech Recognition

3Play Media’s report analyzed the word error rate and the formatted error rate of different ASR engines, which are AI-powered caption generators used in various industries. The findings revealed that even the best engines achieved only around 90% accuracy in word transcription and only around 80% accuracy in both words and formatting, falling short of the industry standard of 99% accuracy required for accessibility compliance.

Legal requirements, such as the Americans with Disabilities Act (ADA), demand accurate and properly placed captions for television and other public services. Caption accuracy varies significantly across different markets and use cases, with news, cinematic, and sports content presenting the greatest challenges for ASR in accurately transcribing. These markets often feature background music, overlapping speech, and difficult audio, resulting in high error rates for word and formatted accuracy.

Although there have been improvements in performance since 3Play Media’s 2022 report, error rates still remain high across all tested markets, necessitating the collaboration of human editors.

The Importance of Human-in-the-Loop Systems

Transcription models, ranging from consumer to industry use, have long incorporated AI-generated audio captioning. Many platforms use human-in-the-loop systems, which combine ASR tools with human editors. Companies like Rev emphasize the significance of human editors in audio-visual syncing, screen formatting, and other essential steps to ensure fully accessible visual media.

Human-in-the-loop (HITL) models have gained prominence in generative AI development to address implicit bias and guide AI with human decision-making. Such models align with the World Wide Web Consortium’s Web Accessibility Initiative’s stance on human oversight, emphasizing that automatically-generated captions require significant editing to meet user needs and accessibility requirements.

3Play Media also acknowledges the limitations of AI in contextualization, highlighting the possibility of errors when words are misunderstood. AI lacks the capacity for context, which can result in the substitution or omission of words, ultimately affecting the accuracy of captions. Consequently, the most effective methods for live captioning involve a combination of AI and human captioners to deliver an experience that comes closest to 100% accuracy.

Flagging AI Hallucinations

In addition to lower accuracy numbers, there is growing concern about AI “hallucinations,” which include factual inaccuracies and the inclusion of completely fabricated sentences. AI-generated text has been criticized for its ease in generating misleading claims and spreading misinformation. Instances such as ChatGPT providing erroneous facts and misleading information highlight the risks associated with relying on AI alone.

Recognizing the risks posed by hallucinations, AI leaders have engaged in continued training and development to mitigate these issues. However, those in need of accessible services cannot afford to wait for developers to perfect their AI systems. The false portrayal of accuracy for deaf and hard-of-hearing viewers poses a significant problem, demonstrating the indispensability of human editors in producing high-quality captions.

The road to incorporating AI into various technologies, including captioning services, requires human oversight to ensure accuracy, accessibility, and trust. As industry leaders continue their efforts to improve AI systems, the collaboration between AI and human editors remains essential in delivering captions accessible to people who are deaf and hard of hearing.

Sign up for Mashable’s Top Stories newsletter today to receive more social good and accessibility stories in your inbox.