Rescue Conversations from Dead-ends: Efficient Exploration for Task-oriented Dialogue Policy Optimization

Yangyang Zhao; Mehdi Dastani; Jinchuan Long; Zhenyu Wang; Shihan Wang

Vol. 12 (2024)

TACL approved

Rescue Conversations from Dead-ends: Efficient Exploration for Task-oriented Dialogue Policy Optimization

Published 2025-12-23

Yangyang
Mehdi
Jinchuan
Zhenyu
Shihan

Yangyang
Changsha University of Science & Technology, Utrecht University, South China University of Technology

Mehdi
Utrecht University

Jinchuan

Zhenyu

Shihan

Abstract

Training a task-oriented dialogue policy using deep reinforcement learning is promising but requires extensive environment exploration. The amount of wasted invalid exploration makes policy learning inefficient. In this paper, we define and argue that dead-end states are important reasons for invalid exploration. When a conversation enters a dead-end state, regardless of the actions taken afterward, it will continue in a dead-end trajectory until the agent reaches a termination state or maximum turn. We propose a Dead-end Detection and Resurrection (DDR) method that detects dead-end states in an efficient manner and provides a rescue action to guide and correct the exploration direction. To prevent dialogue policies from repeating errors, DDR also performs dialogue data augmentation by adding relevant experiences that include dead-end states and penalties into the experience pool. We first validate the dead-end detection reliability and then demonstrate the effectiveness and generality of the method across various domains through experiments on four public dialogue datasets.

Presented at EMNLP 2024 Article at MIT Press

Author Biography

Yangyang

Yangyang Zhao received a Ph.D. degree from South China University of Technology, Guangzhou, China in 2022. She is currently an Lecture at the Department of Computer and Communication Engineering, Changsha University of Science and Technology, China. Her research interests include Task-oriented Dialogue Systems, Deep Reinforcement Learning, and Dialogue Policy Learning.

Mehdi

Mehdi_Dastani obtained a master's degrees in computer science (1991) and philosophy (1992) at the University of Amsterdam, and a Ph.D. degree at the University of Amsterdam (ILLC) with a thesis titled Languages of Perception.Now he is Professor and chair of the Intelligent Systems group of the Department of Information and Computing Sciences at Utrecht University and program leader of the Master's programme Artificial Intelligence. His research focuses on formal and computational models in artificial intelligence. Inspired by knowledge and insights from other scientific disciplines, he investigates and develops computer models for autonomous agents whose behaviors are decided based on reasoning about social and cognitive concepts.

Zhenyu

Zhenyu Wang received the B.S. degree from Xiamen University, Xiamen, China, in 1987, and the M.S. and Ph.D. degrees from Harbin Institute of Technology, Harbin, China, in 1990 and 1993, respectively. From Feb 1994 to July 2001, he joins the School of Information Engineering, Shenzhen University, Shenzhen, promoted to Associate Professor in December 1996. In the period 1998, he was a Visiting Scholar with Soka University, Tokyo, and was appointed as the Director of Technical Researcher by FOURSIS Group, Japan. From July 2014 to December 2021, he served as the Dean of the School of Software Engineering at South China University of Technology and is currently a Professor at the same university in Guangzhou. His research interests include Cloud Computing and Software Service Engineering, Natural Language Processing, and Blockchain.

Shihan

Shihan Wang obtained a B.S. degree from Northeastern University, China, in 2011, an M.S. degree
from the University of Edinburgh, the United Kingdom, in 2012, and a Ph.D. degree from Tokyo
Institute of Technology, Japan in 2017. She is currently an Assistant Professor at Intelligent System group of Utrecht University, the Netherlands. Her current research focuses on intelligent \& interactive systems, with a focus on both single-agent and multi-agent reinforcement learning.