Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
Authors: Hung-Chun Hsu, Yuan-Ching Kuo, Chao-Han Huck Yang, Szu-Wei Fu, Hanrong Ye, Hongxu Yin, Yu-Chiang Frank Wang, Ming-Feng Tsai, Chuan-Ju Wang
arXiv preprint arXiv:2508.18132, 2025

TL;DR
We bring test-time scaling to conversational product search for the first time. While existing multimodal retrievers work well for single queries, they struggle with the iterative, evolving nature of real shopping conversations. Our key insight is introducing a test-time reranking (TTR) mechanism that continuously refines retrieval results as user intent evolves throughout the dialogue. Across multiple benchmarks, TTR delivers consistent gains of 14.5 points in MRR and 10.6 points in nDCG@1, proving that inference-time computation can significantly improve conversational product retrieval with minimal overhead. We're releasing enhanced datasets to accelerate future research in this important but underexplored area.
For further details, please refer to our paper!
