# Automatic Differentiation Automatic Differentiation (AD), also known as *algebraic differentiation* or *autodiff*, is a technique for computing derivatives of computer programs. I find it a remarkably pleasant and interesting branch of computer science, that combines several of my interests: symbolic processing, scientific computing, program transformation, numerical methods, and high performance computing. ## Papers A nonexhaustive list of papers I consider particularly interesting (and thought of while writing the list). ### Introduction * *Automatic Differentiation in Machine Learning: a Survey* ([pdf](https://jmlr.org/papers/volume18/17-468/17-468.pdf)) is an excellent introduction to AD in general, not just for machine learning. I recommend giving this paper to someone who wants to learn what AD is. ### Concepts * *Perturbation confusion in forward automatic differentiation of higher-order functions* ([pdf](https://engineering.purdue.edu/~qobi/talk-videos/jfp2019.pdf)) presents a common bug in the implementation of higher order derivatives. I don't think it is common in modern compiler-based implementations of AD (largely because they do not operate on higher order representations). ### Implementations * *Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients* ([preprint pdf](https://arxiv.org/pdf/2010.01709)) is the basic paper on [Enzyme](https://enzyme.mit.edu/), and is quite readable. * *Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme* ([pdf](https://dl.acm.org/doi/pdf/10.1145/3458817.3476165)) demonstrates Enzyme's handling of CUDA programs. * *Putting the Automatic Back into AD: Part I, What’s Wrong* ([pdf](https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1369&context=ecetr)) describes the authors' efforts at implementing two fairly simple higher order (nested) AD problems using various then-extant (2008) tools. This goes quite poorly for the tools based on source transformation, which often generate invalid code, and often do not support higher order AD. I think the situation has improved since then, as state-of-the-art AD tools often handle nested AD just fine (although it may be a bit verbose sometimes), but the paper is an interesting view at the state of the art in 2008, and some of the criticism (due to high expectations!) is still relevant. ## Benchmark suites * [ADBench](https://github.com/microsoft/ADBench) is a very good polyglot AD benchmark suite developed at Microsoft Research, but unfortunately it was only maintained for a short while. It remains one of the best engineered benchmark suites I have used (although in some cases perhaps over-engineered). * [GradBench](https://gradben.ch/) is a spiritual successor to ADBench, which also aims to subsume all other AD benchmark suites. I contribute actively and I believe GradBench has the best design out of all current AD benchmark suites. * [cmpad](https://github.com/bradbell/cmpad) is a benchmark suite by [Brad Bell](https://www.seanet.com/~bradbell) (of CppAD fame). As of this writing it supports C++ and Python. * [AD Rosetta Stone](https://github.com/qobi/AD-Rosetta-Stone) by Barak Pearlmutter (and presumably [Jeffrey Siskind](https://engineering.purdue.edu/~qobi/) as they do [a lot of AD research together](https://engineering.purdue.edu/~qobi/tweet-talks/)). This is not truly a benchmark suite, but rather to have structurally similar implementations in various languages - hence, [Rosetta Stone](https://en.wikipedia.org/wiki/Rosetta_Stone). However, this work seems closely related to other work that does focus on performance, as the programs are [essentially these](https://engineering.purdue.edu/~qobi/stalingrad-examples2009/). ## Terminology * *Reverse over forward* means first performing a forward-mode transformation, then applying reverse mode to the result of that. With the *ideal API*, this is something like `vjp (\x -> ... jvp ...)`. * *The ideal API* is a term that I believe was coined by [Barak Pearlmutter](https://www.maynoothuniversity.ie/faculty-science-engineering/our-people/barak-pearlmutter), refers to exposing AD through two functions: *Jacobian-Vector Product* (`jvp`, forward mode) and *Vector-Jacobian product* (`vjp`, reverse mode). This turns out to be a very simple and yet very powerful interface. It is used in such languages as Jax and Futhark, and is also present under various other names in other tools.