FVP
4D Visual Pre-training for Robot Learning
Chengkai Hou1 Yanjie Ze3 Yankai Fu1 Zeyu Gao4 Yue Yu2 Songbo Hu2 Shanghang Zhang1 Huazhe Xu235 1Peking University 2Tsinghua University 3Shanghai Qizhi Institute 4CASIA 5Shanghai AI Lab
Accepted by ICCV 2025

FVP is a novel 3D point cloud representation learning pipeline for robotic manipulation.

Abstract.

General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks;

yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we seek for a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP , is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on in-domain task datasets directly.

Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28\%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of FVP adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks.

Different from prior works in Contrastive Learning, Masked Signal Modeling; FVP trains 3D visual representations by leveraging the preceding frame point cloud and employing a diffusion model to predict the point cloud of the current frame.

Real-world Demo.
Approach.

Model Overview

FVP mainly consists of two parts: a 3D visual encoder and a point cloud diffusion model. The 3D visual encoder transforms the point cloud at time step t into the latent visual representation, and the diffusion model uses the latent visual representation and robotic actions to predict the point cloud at step time t+1. During the policy learning stage, we train the pre-trained visual model and the policy backbone jointly.

Real-World Results.

"DP3+FVP" and "RISE+FVP" denote the application of FVP to pretrain the visual models from DP3 and RISE, respectively. "DP3" indicates that the visual model within DP3 has not undergone pretraining. "DP3+PointMAE", "DP3+STRL", and "DP3+C2P" signify the utilization of PointMAE, STRL, and C2P to pre-train the visual model from DP3. The numbers before the comma represent the performance using in-domain datasets for pre-training, while the numbers after the comma represent the performance using out-of-domain datasets (RoboMind) for pre-training.

Visualizing the Diffusion Process of Dexterous Hand Tasks.
Enhancing Effect of FVP on VLA Model Tasks.
BibTeX
@article{cheng2025fvp,
    author    = {Chengkai Hou and Yanjie Ze and Yankai Fu and Zeyu Gao and Yue Yu and Songbo Hu and Shanghang Zhang and Huazhe Xu},
    title     = {FVP: 4D Visual Pre-training for Robot Learning},
    journal   = {ICCV},
    year      = {2025},
  }