FVP
4D Visual Pre-training for Robot Learning
Chengkai Hou1 Yanjie Ze3 Yankai Fu1 Zeyu Gao4 Yue Yu2 Songbo Hu2 Shanghang Zhang 1    Huazhe Xu 2,3,5 1Peking University 2Tsinghua University 3Shanghai Qizhi Institute 4CASIA 5Shanghai AI Lab
Accepted by ICCV 2025

FVP is a novel 3D point cloud representation learning pipeline for robotic manipulation.

Abstract.

General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks;

yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we seek for a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP , is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on in-domain task datasets directly.

Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28\%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of FVP adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks.

Different from prior works in Contrastive Learning, Masked Signal Modeling; FVP trains 3D visual representations by leveraging the preceding frame point cloud and employing a diffusion model to predict the point cloud of the current frame.

Real-world Demo.
Approach.

Model Overview

FVP mainly consists of two parts: a 3D visual encoder and a point cloud diffusion model. The 3D visual encoder transforms the point cloud at time step t into the latent visual representation, and the diffusion model uses the latent visual representation and robotic actions to predict the point cloud at step time t+1. During the policy learning stage, we train the pre-trained visual model and the policy backbone jointly.

Real-World Results.

"DP3+FVP" and "RISE+FVP" denote the application of FVP to pretrain the visual models from DP3 and RISE, respectively. "DP3" indicates that the visual model within DP3 has not undergone pretraining. "DP3+PointMAE", "DP3+STRL", and "DP3+C2P" signify the utilization of PointMAE, STRL, and C2P to pre-train the visual model from DP3. The numbers before the comma represent the performance using in-domain datasets for pre-training, while the numbers after the comma represent the performance using out-of-domain datasets (RoboMind) for pre-training.

Visualizing the Diffusion Process of Dexterous Hand Tasks.
Enhancing Effect of FVP on VLA Model Tasks.
BibTeX
@article{cheng2025fvp,
    author    = {Chengkai Hou and Yanjie Ze and Yankai Fu and Zeyu Gao and Yue Yu and Songbo Hu and Shanghang Zhang and Huazhe Xu},
    title     = {FVP: 4D Visual Pre-training for Robot Learning},
    journal   = {ICCV},
    year      = {2025},
  }