FORTE: Online DataFrame Query Optimizer
DataFrame libraries are widely adopted in data science for their flexible, Pythonic interfaces, but their fragmented APIs and unstructured query patterns limit systematic optimization.
Existing work has explored parallel execution or SQL-style logical rewrites, yet these approaches fall short in capturing DataFrame-specific semantics and Python control-flow context. We present FORTE, the first online, source-to-source query optimizer that unifies multiple DataFrame libraries under a shared intermediate representation (DFL).
DFL makes DataFrame semantics explicit, enabling composable and portable rewriting rules such as user-defined function (UDF) lifting/lowering, loop lifting, and API tuning, alongside classical rewrites (e.g., predicate pushdown). FORTE employs a lightweight, learned cost model and greedy search to apply these rewrites with negligible overhead, while supporting both intra-library optimization and cross-library transpilation. Our evaluation on TPC-H workloads and real-world Kaggle/GitHub workloads shows that FORTE consistently delivers substantial speedups—up to 52.53× (3.7× on average) across Pandas, Modin, Polars, and Pandas-on-Spark—demonstrating that online, IR-guided rewriting can significantly outperform existing DataFrame engines and rewriters, while enabling cross-library retargetability.