Automatic Differentiation
Automatic Differentiation (AD), also known as algebraic differentiation or autodiff, is a technique for computing derivatives of computer programs. I find it a remarkably pleasant and interesting branch of computer science, that combines several of my interests: symbolic processing, scientific computing, program transformation, numerical methods, and high performance computing.
Papers
A nonexhaustive list of papers I consider particularly interesting (and thought of while writing the list).
Introduction
- Automatic Differentiation in Machine Learning: a Survey (pdf) is an excellent introduction to AD in general, not just for machine learning. I recommend giving this paper to someone who wants to learn what AD is.
Concepts
- Perturbation confusion in forward automatic differentiation of higher-order functions (pdf) presents a common bug in the implementation of higher order derivatives. I don't think it is common in modern compiler-based implementations of AD (largely because they do not operate on higher order representations).
Implementations
Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients (preprint pdf) is the basic paper on Enzyme, and is quite readable.
Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme (pdf) demonstrates Enzyme's handling of CUDA programs.
Putting the Automatic Back into AD: Part I, What’s Wrong (pdf) describes the authors' efforts at implementing two fairly simple higher order (nested) AD problems using various then-extant (2008) tools. This goes quite poorly for the tools based on source transformation, which often generate invalid code, and often do not support higher order AD. I think the situation has improved since then, as state-of-the-art AD tools often handle nested AD just fine (although it may be a bit verbose sometimes), but the paper is an interesting view at the state of the art in 2008, and some of the criticism (due to high expectations!) is still relevant.
Provably Correct, Asymptotically Efficient, Higher-Order Reverse-Mode Automatic Differentiation (pdf) is a paper whose title mentions most of what one might desire in an AD tool, and it does largely deliver. In particular the first part is an eminently readable exposition of how to build up a reverse mode AD transformation from a PL perspective. The idea is at a high level to form a tape that is a composition of linear maps, then show how that composition can be implemented efficiently. The paper is mainly theoretical, but the ideas are largely already implemented as the Haskell library ad, and an extension (which is supposedly going to be much more efficient) is being developed as horde-ad. I would say the main downside of the approach is that the transformation does not nest (the output language is not the same as the input language), so higher order derivatives are not straightforward, although the paper does mention some ideas for resolving this.
Benchmark suites
ADBench is a very good polyglot AD benchmark suite developed at Microsoft Research, but unfortunately it was only maintained for a short while. It remains one of the best engineered benchmark suites I have used (although in some cases perhaps over-engineered).
GradBench is a spiritual successor to ADBench, which also aims to subsume all other AD benchmark suites. I contribute actively and I believe GradBench has the best design out of all current AD benchmark suites.
cmpad is a benchmark suite by Brad Bell (of CppAD fame). As of this writing it supports C++ and Python.
AD Rosetta Stone by Barak Pearlmutter (and presumably Jeffrey Siskind as they do a lot of AD research together). This is not truly a benchmark suite, but rather to have structurally similar implementations in various languages - hence, Rosetta Stone. However, this work seems closely related to other work that does focus on performance, as the programs are essentially these.
Terminology
Reverse over forward (or reverse on forward) means first performing a forward-mode transformation, then applying reverse mode to the result of that. With the ideal API, this is something like
vjp (\x -> ... jvp ...)
.The ideal API is a term that I believe was coined by Barak Pearlmutter, refers to exposing AD through two functions: Jacobian-Vector Product (
jvp
, forward mode) and Vector-Jacobian product (vjp
, reverse mode). This turns out to be a very simple and yet very powerful interface. It is used in such languages as Jax and Futhark, and is also present under various other names in other tools.The Jacobian for a function with n inputs and m outputs is a matrix with m rows and n columns. Forward mode computes one row of the Jacbobian, and reverse mode computes one column.
A sparse Jacobian is a Jacobian that has (usually many) nonzero entries.
Cross country is the idea of using both forward and reverse mode to compute rows and columns of a sparse Jacobian such that we have eventually computed all nonzero entries. Finding the optimal evaluation order is nontrivial.
Tape, trace, and Wengert list mean the same thing: data recorded during the forward sweep of reverse mode, which is then used during the return sweep.