EchoJEPA

A state-of-the-art foundation model for echocardiography trained on 18M videos across 300K patients.

Attention maps show EchoJEPA learns cardiac anatomy and motion

Abstract

EchoJEPA is a state-of-the-art foundation model for echocardiography trained on 18 million videos across 300K patients, the largest pretraining corpus to date. It outperforms prior methods when trained even on only 1% of labeled data.

State-of-the-Art

~20% better LVEF estimation via latent prediction

Multi-View Framework

Integrates views without view-specific components

Unified Evaluation

Frozen backbones for fair comparison

Robust to Degradation

86% less sensitivity to acoustic artifacts

Methods

Latent Predictive Pretraining

1

Mask & Encode

Partition videos into spatio-temporal tubelets, mask, and encode visible context

2

Predict in Latent Space

Predictor infers embeddings of masked regions from visible patches

3

EMA Target

L₁ loss against exponential moving average encoder suppresses speckle noise

EchoJEPA Architecture

EchoJEPA architecture. Encoder processes visible patches; predictor infers masked embeddings.

Multi-View Attentive Probing

Aggregates information across multiple echocardiographic views for study-level predictions.

Factorized Embeddings

Learnable view + clip embeddings encode study position

Attention Masking

Ignores tokens from missing views during cross-attention

View Dropout

10% random masking for robustness to incomplete studies

Multi-view Probing Framework

Multi-view probing. Frozen encoder extracts embeddings; factorized stream embeddings encode position; view dropout improves robustness.

Results

Comprehensive evaluation across multiple downstream tasks and clinical sites.

~20%

LVEF improvement

vs. baselines

78.6%

View accuracy @ 1%

vs. 42% baseline @ 100%

86%

Robustness

lower degradation, compared to next best model

4.32

Pediatric zero-shot

zero-shot error, compared to 5.10 for next best model

Downstream Evaluation Tasks

Downstream evaluation. Right Ventricular Systolic Pressure (RVSP), Left Ventricular Ejection Fraction (LVEF), and view classification.

Model Configurations
ModelBackboneVideos
VideoMAE-L†ViT-L (300M)525K
EchoJEPA-LViT-L (300M)525K
EchoJEPA-GViT-G (1.1B)18.1M

†Compute-matched baseline

Comparison Tables

Head-to-head performance against state-of-the-art baselines.

Latent vs. Pixel PredictionSame compute
ModelLVEF MAEView Acc
EchoMAE-L8.1540.4%
EchoJEPA-L5.9785.5%
Improvement-26.7%+45.1%
LVEF RegressionMAE % (lower=better)
ModelTorontoChicagoStanford†
EchoPrime5.336.714.87
PanEcho5.436.525.45
EchoJEPA-G4.265.443.97

†EchoNet-Dynamic

View Classification% Train Data
Model1%10%100%
EchoPrime21.632.142.1
PanEcho21.530.641.9
EchoJEPA-G78.684.487.4
Multi-View RVSPMAE mmHg
ModelTorontoChicago
EchoPrime5.655.29
PanEcho5.495.26
EchoJEPA-G4.544.91
Adult→PediatricLVEF MAE %
ModelZero-ShotFine-Tuned
EchoPrime5.104.53
PanEcho5.665.34
EchoJEPA-G4.323.88
Robustness to Acoustic DegradationLVEF MAE % on Stanford
ModelCleanDepth AttenuationGaussian ShadowAvg Deg
EchoPrime4.875.585.715.915.555.615.78+16.8%
PanEcho5.105.105.395.465.195.215.38+3.7%
EchoJEPA-G3.974.014.074.174.024.044.07+2.3%

Attention Visualization

V-JEPA localizes on anatomy; VideoMAE tracks artifacts

Attention Visualization

Attention maps. Fine-tuned V-JEPA demonstrates precise anatomical localization synchronized to cardiac motion.

Latent Space Organization

EchoJEPA forms distinct anatomical clusters

UMAP Visualization

UMAP of frozen representations. Baselines show diffuse distributions; EchoJEPA separates views cleanly.

Authors

Alif Munim1,6*
Adibvafa Fallahpour1,5,2*
Teodora Szasz3,7*
Ahmadreza Attarpour1
River Jiang4
Brana Sooriyakanthan1
Maala Sooriyakanthan1
Heather Whitney3
Jeremy Slivnick3
Barry Rubin1,2
Wendy Tsang1,2
Bo Wang1,5,2
1University Health Network2University of Toronto3University of Chicago4UC San Francisco5Vector Institute6Cohere Labs7Philips Health

* Equal contribution

Citation

If you find EchoJEPA useful, please cite our paper.

BibTeX
@misc{munim2026echojepalatentpredictivefoundation,
      title={EchoJEPA: A Latent Predictive Foundation Model for Echocardiography}, 
      author={Alif Munim and Adibvafa Fallahpour and Teodora Szasz and Ahmadreza Attarpour and River Jiang and Brana Sooriyakanthan and Maala Sooriyakanthan and Heather Whitney and Jeremy Slivnick and Barry Rubin and Wendy Tsang and Bo Wang},
      year={2026},
      eprint={2602.02603},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      url={https://arxiv.org/abs/2602.02603}, 
}