Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Video Temporal Grounding

Jian Hu 1
Zixu Cheng 1
Shaogang Gong 1
Isabel Guan 2
Jianye Hao 3
Jun Wang 4
Kun Shao 3
1Queen Mary University of London, 2Hong Kong University of Science and Technology, 3Huawei Noah's Ark Lab, 4University College London
{jian.hu, zixu.cheng, s.gong}@qmul.ac.uk, {jianye.hao,shaokun2}@huawei.com, jun.wang@cs.ucl.ac.uk
Code [GitHub]
NeurIPS 2025 [Paper]


Motivation: (a) A comparison between full-data and data-efficient adaptation in Cross-domain Temporal Grounding (CTG). Existing methods adapt on thousands of unlabelled target videos (grey), which is slow and resource-heavy. We propose a data-efficient CTG setting using only 100 or 200 randomly selected target videos. Despite the limited data, our method matches or exceeds performance on the TaCoS $\rightarrow$ ActivityNet task. (b) Conceptual comparison between MC Dropout and GRPO rollout. MC Dropout samples subnetworks via stochastic neuron dropout and estimates uncertainty from output diversity. GRPO rollouts similarly sample diverse structural sequences from the policy. URPA leverages this property to generate averaged pseudo labels and estimate uncertainty via rollout standard deviation, enabling uncertainty-quantified adaptation without ground-truth labels.


Abstract

Video Temporal Grounding (TG) aims to temporally locate video segments matching a natural language description (a query) in a long video. While Vision-Language Models (VLMs) are effective at holistic semantic matching, they often struggle with fine-grained temporal localisation. Recently, Group Relative Policy Optimisation (GRPO) reformulates the inference process as a reinforcement learning task, enabling fine-grained grounding and achieving strong in-domain performance. However, GRPO relies on labelled data, making it unsuitable in unlabelled domains. Moreover, because videos are large and expensive to store and process, performing full-scale adaptation introduces prohibitive latency and computational overhead, making it impractical for real-time deployment. To overcome both problems, we introduce a Data-Efficient Unlabelled Cross-domain Temporal Grounding method, from which a model is first trained on a labelled source domain, then adapted to a target domain using only a small number of unlabelled videos from the target domain. This approach eliminates the need for target annotation and keeps both computational and storage overhead low enough to run in real time. Specifically, we introduce Uncertainty-quantified Rollout Policy Adaptation (URPA) for cross-domain knowledge transfer in learning video temporal grounding without target labels. URPA generates multiple candidate predictions using GRPO rollouts, averages them to form a pseudo label, and estimates confidence from the variance across these rollouts. This confidence then weights the training rewards, guiding the model to focus on reliable supervision. Experiments on three datasets across six cross-domain settings show that URPA generalises well using only a few unlabelled target videos.



Framework

Uncertainty-quantified Rollout Policy Adaptation (URPA): During source model training, we perform supervised GRPO training using labelled videos \( V_s \). Specifically, the format reward \( R_{\text{format}} \) encourages the model to “think first and then answer,” while the accuracy reward \( R_{\text{tiou}} \) aligns the predicted temporal grounding \( I_s^{\text{pred}} \) with the relaxed ground truth \( \tilde{I}_s^{\text{gt}} \) for supervised learning. In target model knowledge adaptation, we adapt the model using \( K \) unlabelled target videos. For each video \( V_t \), we first compute the average output over \( G \) rollouts to obtain a pseudo label \( \hat{I}_t^{\text{gt}} \). Then, we calculate the standard deviation across these rollouts and transform it into a confidence score \( c \) to quantify uncertainty on pseudo labels, which is then used to weight different pseudo-labels when constructing a weighted reward function for test-time target model adaptation.

Experiments


Qualitative Evaluation


Qualitative Analysis on Charades → ActivityNet.


BibTeX

@article{hu2025cos,
  title={Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Video Temporal Grounding},
  author={Hu, Jian and Cheng, Zixu and Gong, Shaogang and Guan, Isabel and Hao, Jianye and Wang, Jun and Shao, Kun},
  journal={arXiv preprint arXiv:2508.06317},
  year={2025}
}