Automatic Differentiation

Automatic Differentiation (AD), also known as algebraic differentiation or autodiff, is a technique for computing derivatives of computer programs. I find it a remarkably pleasant and interesting branch of computer science, that combines several of my interests: symbolic processing, scientific computing, program transformation, numerical methods, and high performance computing.

Papers

A nonexhaustive list of papers I consider particularly interesting (and thought of while writing the list).

Introduction

Automatic Differentiation in Machine Learning: a Survey (pdf) is an excellent introduction to AD in general, not just for machine learning. I recommend giving this paper to someone who wants to learn what AD is.

Concepts

Perturbation confusion in forward automatic differentiation of higher-order functions (pdf) presents a common bug in the implementation of higher order derivatives. I don't think it is common in modern compiler-based implementations of AD (largely because they do not operate on higher order representations).

Implementations

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients (preprint pdf) is the basic paper on Enzyme, and is quite readable.
Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme (pdf) demonstrates Enzyme's handling of CUDA programs.
Putting the Automatic Back into AD: Part I, What’s Wrong (pdf) describes the authors' efforts at implementing two fairly simple higher order (nested) AD problems using various then-extant (2008) tools. This goes quite poorly for the tools based on source transformation, which often generate invalid code, and often do not support higher order AD. I think the situation has improved since then, as state-of-the-art AD tools often handle nested AD just fine (although it may be a bit verbose sometimes), but the paper is an interesting view at the state of the art in 2008, and some of the criticism (due to high expectations!) is still relevant.

Benchmark suites

ADBench is a very good polyglot AD benchmark suite developed at Microsoft Research, but unfortunately it was only maintained for a short while. It remains one of the best engineered benchmark suites I have used (although in some cases perhaps over-engineered).
GradBench is a spiritual successor to ADBench, which also aims to subsume all other AD benchmark suites. I contribute actively and I believe GradBench has the best design out of all current AD benchmark suites.
cmpad is a benchmark suite by Brad Bell (of CppAD fame). As of this writing it supports C++ and Python.
AD Rosetta Stone by Barak Pearlmutter (and presumably Jeffrey Siskind as they do a lot of AD research together). This is not truly a benchmark suite, but rather to have structurally similar implementations in various languages - hence, Rosetta Stone. However, this work seems closely related to other work that does focus on performance, as the programs are essentially these.

Terminology

Reverse over forward means first performing a forward-mode transformation, then applying reverse mode to the result of that. With the ideal API, this is something like vjp (\x -> ... jvp ...).
The ideal API is a term that I believe was coined by Barak Pearlmutter, refers to exposing AD through two functions: Jacobian-Vector Product (jvp, forward mode) and Vector-Jacobian product (vjp, reverse mode). This turns out to be a very simple and yet very powerful interface. It is used in such languages as Jax and Futhark, and is also present under various other names in other tools.