EchoJEPA

A state-of-the-art foundation model for echocardiography trained on 18M videos across 300K patients.

Attention maps show EchoJEPA learns cardiac anatomy and motion

Abstract

EchoJEPA is a state-of-the-art foundation model for echocardiography trained on 18 million videos across 300K patients, the largest pretraining corpus to date. It outperforms prior methods when trained even on only 1% of labeled data.

State-of-the-Art

~20% better LVEF estimation via latent prediction

Multi-View Framework

Integrates views without view-specific components

Unified Evaluation

Frozen backbones for fair comparison

Robust to Degradation

86% less sensitivity to acoustic artifacts

Methods

Latent Predictive Pretraining

Mask & Encode

Partition videos into spatio-temporal tubelets, mask, and encode visible context

Predict in Latent Space

Predictor infers embeddings of masked regions from visible patches

EMA Target

L₁ loss against exponential moving average encoder suppresses speckle noise

EchoJEPA architecture. Encoder processes visible patches; predictor infers masked embeddings.

Multi-View Attentive Probing

Aggregates information across multiple echocardiographic views for study-level predictions.

Factorized Embeddings

Learnable view + clip embeddings encode study position

Attention Masking

Ignores tokens from missing views during cross-attention

View Dropout

10% random masking for robustness to incomplete studies

Multi-view probing. Frozen encoder extracts embeddings; factorized stream embeddings encode position; view dropout improves robustness.

Results

Comprehensive evaluation across multiple downstream tasks and clinical sites.

~20%

LVEF improvement

vs. baselines

78.6%

View accuracy @ 1%

vs. 42% baseline @ 100%

86%

Robustness

lower degradation, compared to next best model

4.32

Pediatric zero-shot

zero-shot error, compared to 5.10 for next best model

Downstream evaluation. Right Ventricular Systolic Pressure (RVSP), Left Ventricular Ejection Fraction (LVEF), and view classification.

Model Configurations

Model	Backbone	Videos
VideoMAE-L†	ViT-L (300M)	525K
EchoJEPA-L	ViT-L (300M)	525K
EchoJEPA-G	ViT-G (1.1B)	18.1M

†Compute-matched baseline

Comparison Tables

Head-to-head performance against state-of-the-art baselines.

Latent vs. Pixel PredictionSame compute

Model	LVEF MAE	View Acc
EchoMAE-L	8.15	40.4%
EchoJEPA-L	5.97	85.5%
Improvement	-26.7%	+45.1%

LVEF RegressionMAE % (lower=better)

Model	Toronto	Chicago	Stanford†
EchoPrime	5.33	6.71	4.87
PanEcho	5.43	6.52	5.45
EchoJEPA-G	4.26	5.44	3.97

†EchoNet-Dynamic

View Classification% Train Data

Model	1%	10%	100%
EchoPrime	21.6	32.1	42.1
PanEcho	21.5	30.6	41.9
EchoJEPA-G	78.6	84.4	87.4

Multi-View RVSPMAE mmHg

Model	Toronto	Chicago
EchoPrime	5.65	5.29
PanEcho	5.49	5.26
EchoJEPA-G	4.54	4.91

Adult→PediatricLVEF MAE %

Model	Zero-Shot	Fine-Tuned
EchoPrime	5.10	4.53
PanEcho	5.66	5.34
EchoJEPA-G	4.32	3.88

Robustness to Acoustic DegradationLVEF MAE % on Stanford

Model	Clean	Depth Attenuation			Gaussian Shadow			Avg Deg
EchoPrime	4.87	5.58	5.71	5.91	5.55	5.61	5.78	+16.8%
PanEcho	5.10	5.10	5.39	5.46	5.19	5.21	5.38	+3.7%
EchoJEPA-G	3.97	4.01	4.07	4.17	4.02	4.04	4.07	+2.3%

Attention Visualization

V-JEPA localizes on anatomy; VideoMAE tracks artifacts

Attention maps. Fine-tuned V-JEPA demonstrates precise anatomical localization synchronized to cardiac motion.

Latent Space Organization

EchoJEPA forms distinct anatomical clusters

UMAP of frozen representations. Baselines show diffuse distributions; EchoJEPA separates views cleanly.

Authors

Alif Munim^1,6*

Adibvafa Fallahpour^1,5,2*

Teodora Szasz^3,7*

Ahmadreza Attarpour¹

River Jiang⁴

Brana Sooriyakanthan¹

Maala Sooriyakanthan¹

Heather Whitney³

Jeremy Slivnick³

Barry Rubin^1,2

Wendy Tsang^1,2

Bo Wang^1,5,2

¹University Health Network²University of Toronto³University of Chicago⁴UC San Francisco⁵Vector Institute⁶Cohere Labs⁷Philips Health

* Equal contribution

Citation

If you find EchoJEPA useful, please cite our paper.

BibTeX

@misc{munim2026echojepalatentpredictivefoundation,
      title={EchoJEPA: A Latent Predictive Foundation Model for Echocardiography}, 
      author={Alif Munim and Adibvafa Fallahpour and Teodora Szasz and Ahmadreza Attarpour and River Jiang and Brana Sooriyakanthan and Maala Sooriyakanthan and Heather Whitney and Jeremy Slivnick and Barry Rubin and Wendy Tsang and Bo Wang},
      year={2026},
      eprint={2602.02603},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      url={https://arxiv.org/abs/2602.02603}, 
}