Abstract
EchoJEPA is a state-of-the-art foundation model for echocardiography trained on 18 million videos across 300K patients, the largest pretraining corpus to date. It outperforms prior methods when trained even on only 1% of labeled data.
State-of-the-Art
~20% better LVEF estimation via latent prediction
Multi-View Framework
Integrates views without view-specific components
Unified Evaluation
Frozen backbones for fair comparison
Robust to Degradation
86% less sensitivity to acoustic artifacts
Methods
Latent Predictive Pretraining
Mask & Encode
Partition videos into spatio-temporal tubelets, mask, and encode visible context
Predict in Latent Space
Predictor infers embeddings of masked regions from visible patches
EMA Target
L₁ loss against exponential moving average encoder suppresses speckle noise

EchoJEPA architecture. Encoder processes visible patches; predictor infers masked embeddings.
Multi-View Attentive Probing
Aggregates information across multiple echocardiographic views for study-level predictions.
Factorized Embeddings
Learnable view + clip embeddings encode study position
Attention Masking
Ignores tokens from missing views during cross-attention
View Dropout
10% random masking for robustness to incomplete studies

Multi-view probing. Frozen encoder extracts embeddings; factorized stream embeddings encode position; view dropout improves robustness.
Results
Comprehensive evaluation across multiple downstream tasks and clinical sites.
~20%
LVEF improvement
vs. baselines
78.6%
View accuracy @ 1%
vs. 42% baseline @ 100%
86%
Robustness
lower degradation, compared to next best model
4.32
Pediatric zero-shot
zero-shot error, compared to 5.10 for next best model

Downstream evaluation. Right Ventricular Systolic Pressure (RVSP), Left Ventricular Ejection Fraction (LVEF), and view classification.
| Model | Backbone | Videos |
|---|---|---|
| VideoMAE-L† | ViT-L (300M) | 525K |
| EchoJEPA-L | ViT-L (300M) | 525K |
| EchoJEPA-G | ViT-G (1.1B) | 18.1M |
†Compute-matched baseline
Comparison Tables
Head-to-head performance against state-of-the-art baselines.
| Model | LVEF MAE | View Acc |
|---|---|---|
| EchoMAE-L | 8.15 | 40.4% |
| EchoJEPA-L | 5.97 | 85.5% |
| Improvement | -26.7% | +45.1% |
| Model | Toronto | Chicago | Stanford† |
|---|---|---|---|
| EchoPrime | 5.33 | 6.71 | 4.87 |
| PanEcho | 5.43 | 6.52 | 5.45 |
| EchoJEPA-G | 4.26 | 5.44 | 3.97 |
†EchoNet-Dynamic
| Model | 1% | 10% | 100% |
|---|---|---|---|
| EchoPrime | 21.6 | 32.1 | 42.1 |
| PanEcho | 21.5 | 30.6 | 41.9 |
| EchoJEPA-G | 78.6 | 84.4 | 87.4 |
| Model | Toronto | Chicago |
|---|---|---|
| EchoPrime | 5.65 | 5.29 |
| PanEcho | 5.49 | 5.26 |
| EchoJEPA-G | 4.54 | 4.91 |
| Model | Zero-Shot | Fine-Tuned |
|---|---|---|
| EchoPrime | 5.10 | 4.53 |
| PanEcho | 5.66 | 5.34 |
| EchoJEPA-G | 4.32 | 3.88 |
| Model | Clean | Depth Attenuation | Gaussian Shadow | Avg Deg | ||||
|---|---|---|---|---|---|---|---|---|
| EchoPrime | 4.87 | 5.58 | 5.71 | 5.91 | 5.55 | 5.61 | 5.78 | +16.8% |
| PanEcho | 5.10 | 5.10 | 5.39 | 5.46 | 5.19 | 5.21 | 5.38 | +3.7% |
| EchoJEPA-G | 3.97 | 4.01 | 4.07 | 4.17 | 4.02 | 4.04 | 4.07 | +2.3% |
Attention Visualization
V-JEPA localizes on anatomy; VideoMAE tracks artifacts

Attention maps. Fine-tuned V-JEPA demonstrates precise anatomical localization synchronized to cardiac motion.
Latent Space Organization
EchoJEPA forms distinct anatomical clusters

UMAP of frozen representations. Baselines show diffuse distributions; EchoJEPA separates views cleanly.