|
|
|
|
|
|
|
![]() |
![]() 'yes'
or 'no' ),
thereby identifying sparsely distributed task-related shots to
achieve pseudo temporal grounding. Given this binary video
summary, task-related positive shots \( S^p \) and irrelevant negative shots \( S^n \)
are generated and represented by binary codes. \( S^p \), \( S^n \) and the original frame sequence \( X \)
sampled from original video \( V \) are then fed into the MLLM for co-reasoning,
minimising interference of irrelevant video content.
|
|
![]()
|
|
![]() |