My research so far has focused on evaluating the utility of user simulation. We divided the evaluation problems into two parts: one is to estimate how human-like the simulated corpora are; and the other is to measure how useful the user simulations are for a particular spoken dialog system design task since we believe this is a task-dependent question.
These are two separate questions since a human-like corpus is not necessarily the best for strategy learning. For the first question, our experiments have showed that some of the previously used measures do not provide enough information to figure out why two corpora are different and to what extent the two corpora are different. We observe that two real corpora can be very different when measured by these evaluation measures. Thus, even if these measures demonstrate that a simulated corpus is different from a real corpus, we cannot conclude that the simulated corpus is not realistic enough. Therefore, new evaluation metrics need to be proposed for further comparisons between real and simulated corpora.
To answer the second question, we are currently conducting an experiment to compare the generated dialogue policies from different simulated corpora to see what kind of corpus gives the best policy. Our preliminary results suggest that the quality of generated corpus depend on both user simulations and the machine learning algorithm. Thus, different types of simulation may be needed for different learning algorithms or different learning configurations.
I plan to continue my current research in two directions: 1) simulate more human-like behaviors (especially user emotions) and propose evaluation metrics to measure their realness; 2) find a learning algorithm and a matching simulation model that work best for the dialogue system which I am currently working on, learn a dialogue strategy automatically and implement it into the system, and then evaluate the new system with human users to validate the whole learning process.