Investigation of the Effects of Automatic Scoring Technology on Human Raters' Performances in L2 Speech Proficiency Assessment

This study investigates how automatic scorings based on speech technology can affect human raters' judgement of students' oral language proficiency in L2 speaking tests. Automatic scorings based on ASR are widely used in non-critical speaking tests or practices and relatively high correlations between machine scores and human scores have been reported. In high-stakes speaking tests, however, many teachers remain skeptical about the fairness of automatic scores given by machines even with the most advanced scoring methods. In this paper, we first investigate ASR-based scorings on students’ recordings of real tests. We then propose a radar chart based scoring method to assist human raters and analyze the effects of automatic scores on human raters’ performances. Instead of providing an overall machine score for each utterance or speaker, we provide 10 scores presented as a radar chart to represent different aspects of phonemic and prosodic level proficiency, and leave the final judgment to human raters. Experimental results show that automatic scores can significantly affects human raters’ judgement. With sufficient training samples, the scores given by non-experts can be comparable to experts’ ratings in reliability.

Paper.No_.25.pptx

Paper.No_.25.pptx (884)

Thumbs Up

CITE

Documents

Presentation Slides

Investigation of the Effects of Automatic Scoring Technology on Human Raters' Performances in L2 Speech Proficiency Assessment

Paper.No_.25.pptx

QUESTIONS?