PriTran: Privacy-Preserving Inference for Transformer-Based Language Models under Fully Homomorphic Encryption
Transformer-based language models power many cloud services, but inference on sensitive data raises confidentiality concerns. Fully Homomorphic Encryption (FHE) enables computation on encrypted inputs while preserving privacy, but at high computational cost, making Transformers difficult to deploy. This paper presents PriTran, an efficient CKKS-based library for privacy-preserving Transformer inference on CPUs. Complementing the only prior work, RoLe, which supports only BERT-Tiny (2 encoders), PriTran introduces two novel algorithms with optimized data layouts that accelerate ciphertext–plaintext (CP) and ciphertext–ciphertext (CC) matrix multiplications (MMs) across all BERT models by reducing costly rotations and multiplications. On the MNLI dataset, RoLe fails on inputs longer than 36 tokens within a 5-hour per-token budget, while PriTran achieves average speedups of 29.3% and 22.2% for CP- and CC-MMs, respectively, and 24.1% end-to-end. We further evaluate PriTran on scaled BERT-Tiny variants with additional encoders and on BERT-Mini (4 encoders), demonstrating correctness and scalability beyond RoLe's limits. Within current FHE limits, these gains and RoLe's failure on longer inputs underscore PriTran's promise as a practical approach for FHE-based Transformer inference.