Efficient Autoregressive Inference for Transformer Probabilistic Models

Abstract

Transformer-based models for amortized probabilistic inference, such as neural processes, prior-fitted networks, and tabular foundation models, excel at single-pass marginal prediction. However, many real-world applications require coherent joint distributions that capture dependencies between predictions. While purely autoregressive architectures efficiently generate such distributions, they sacrifice the flexible set-conditioning that makes these models powerful for meta-learning. Conversely, the standard approach to obtain joint distributions from set-based models requires expensive re-encoding of an updated context set at each autoregressive step.

We introduce a causal autoregressive buffer that preserves the advantages of both paradigms. Our approach decouples context encoding from updating the conditioning set. The model processes the context once and caches it, while a dynamic buffer captures target dependencies: as targets are incorporated, they enter the buffer and attend to both the cached context and previously buffered targets. This enables efficient batched autoregressive generation and one-pass joint predictive density evaluation. Training seamlessly integrates set-based and autoregressive modes at minimal additional cost. Across synthetic functions, EEG signals, cognitive models, and tabular data, our method matches the predictive accuracy of strong baselines while delivering up to 20× faster joint sampling.

Method

Our key insight is to separate the roles of the initial context and predicted targets. We preserve permutation invariance for the initial context (encoded once and cached) while handling target dependencies through a separate causal mechanism.

Standard Autoregressive

\(\mathcal{O}(K(N+K)^2)\) complexity

Re-encodes the entire augmented context set at each step. Each new prediction triggers complete re-computation of the context representation.

Our Buffered Approach

\(\mathcal{O}(N^2 + NK + K^2)\) complexity

Encodes context once and caches it. New predictions enter a causal buffer that attends to both the cached context and previous buffer entries.

Computational benchmarks. Wall-clock time (log scale) for sampling, density evaluation, and training, along with peak memory usage versus context points \(N\). Our method closely matches autoregressive baselines in predictive performance while offering significant speedups and lower memory usage.

Contributions

1 We introduce the causal autoregressive buffer, a mechanism that decouples set-based context encoding from sequential prediction, enabling efficient joint sampling and predictive density evaluation from transformer-based amortized probabilistic models.
2 We propose a unified training strategy using masked attention and buffer-size curriculum that allows a single model to learn both modes of operation at minimal additional cost.
3 We demonstrate that our approach is broadly applicable to transformer-based probabilistic models including TNPs/PFNs and tabular foundation models (TabICL), achieving up to 20× speedup while maintaining comparable predictive accuracy across diverse tasks.

Experiments

We validate our method across diverse tasks: regression on synthetic functions, interpolation of real-world EEG data, Bayesian model selection on a multisensory perception model, and pre-training of a tabular foundation model.

Task	TNP-D (AR)	TNP-D (Ind)	TNP-A	Ours (K=16)
GP	2.57	2.22	2.24	2.51
Sawtooth	1.05	0.94	0.98	1.00
EEG-Int	0.51	0.36	0.58	0.52
EEG-For	1.07	-0.74	1.23	0.85

Average predictive density (↑) results on synthetic functions and EEG tasks. Our method (\(K=16\)) achieves comparable performance to the expensive TNP-D (AR) baseline while being up to 20× faster.

Multisensory causal inference model comparison. Log marginal likelihood (LML) comparison for both \(\rho=1\) and \(\rho=\frac{4}{3}\), and LML difference. Our method closely aligns with the ground-truth for Bayesian model selection.

Citation

If you find this work useful, please cite our paper:

@inproceedings{hassan2026efficient,
  title={Efficient Autoregressive Inference for Transformer Probabilistic Models},
  author={Conor Hassan and Nasrulloh Loka and Cen-You Li and Daolang Huang and Paul E. Chang and Yang Yang and Francesco Silvestrin and Samuel Kaski and Luigi Acerbi},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://arxiv.org/abs/2510.09477},
}