PRIORITIZING DATA ACQUISITION FOR END-TO-END SPEECH MODEL IMPROVEMENT

As speech processing moves toward more data-hungry models, data selection and acquisition become crucial to building better systems. Recent efforts have championed quantity over quality, following the mantra ``The more data, the better.''
However, not every data brings the same benefit. This paper proposes a data acquisition solution that yields better models with less data -- and lower cost.
Given a model, a task, and an objective to maximize, we propose a process with three steps. First, we assess the model's baseline performance on the task.
Second, we use efficient mining techniques to identify subgroups that maximize the target objective if acquired first as new samples. Being the subgroups interpretable, we can determine which samples to acquire. Third, we run incremental training sampling from those subgroups. Experiments with two state-of-the-art speech models for Intent Classification across two datasets in English and Italian show that our method is significantly better than random or complete acquisition and clustering-based techniques.

Data_Market.pdf

Data_Market.pdf (215)

Links:

Prioritizing Data Acquisition For End-to-End Speech Model Improvement

Thumbs Up

CITE

Documents

Poster

PRIORITIZING DATA ACQUISITION FOR END-TO-END SPEECH MODEL IMPROVEMENT

Data_Market.pdf

QUESTIONS?