The GradBench Benchmark Suite for Automatic Differentiation
This is a post about the GradBench benchmark suite initiated by Sam Estep. It is a companion post to my presentation at EuroAD 2025, and contains material and elaboration that was not a good fit for the presentation. I'm merely a GradBench contributor, so everything below is my opinion, and although it is well-reasoned and correct, like anything else on this website, it is not the official opinion of GradBench, whatever that might mean. With this post, I hope to achieve the following:
If you are an AD user, or just interested, convince you to contribute benchmark implementations.
If you are the implementer of some AD tool, convince you to add benchmark implementations using your tool, and to use GradBench's implementations in your own work.
If you are a person with a suitable computer, that you will help test that GradBench runs for you.
Background
In applied computer science, an important part of developing a new tool or algorithm is to compare it against existing work. (Some use the word "competitors", but I feel this sounds a bit too antagonistic.) Qualitative comparisons purely based on the texts of papers can only go so far: a good comparison requires you to solve the same problems using both your new tool and existing tools, and comparing the outcomes. This comparison can be both quantitative ("how fast is it?") and qualitative ("is it easy or elegant to use?"), but it is in most cases also time consuming. It takes significant effort to faithfully solve some problem using different tools, of which one is unlikely to be an expert in all of them. At the end, you are often left with some doubt about whether you truly represented your competitors (there's that word again...) in the best way - and your peers will wonder the same!
A solution to this problem is to have a benchmark suite: a collection of problems, along with their solutions using various existing tools. When proposing some new tool, you then only have to solve these hopefully well-described problems once, with your new tool, and compare the result with the existing solutions. Some desirable properties of a benchmark suite include:
It should contain a variety of problems that are qualitatively different, in order to have a good likelihood of demonstrating the strengths and weaknesses of any given tool.
The implementations included should be of high quality.
It should be easy to run, and there should be confidence that the results are meaningful (i.e., you need validation of results).
It should be easy to add a new implementation or problem.
While these are technical properties, some of them are best ensured using a combination of technical and social means. In particular, it is unlikely that any single person will be an expert on all tools available in some community, so property (2) is best ensured by making it easy for experts to contribute improvements as they are able. This implies lowering the barrier to entry, particularly by avoiding complicated setup requirements. The most useful benchmark suites are communally maintained by the communities they serve.
ADBench
Through Cosmin Oancea, I became interested in automatic differentiation some years ago, and it remains a research area that I find deeply fascinating. Our main research activity concerned the development of an AD transformation for Futhark. To evaluate the effectiveness of our work, we implemented problems from ADBench, a benchmark suite developed at Microsoft Research in roughly 2018-2020. Compared to the benchmark suites I had previously encountered (mostly for parallel programming), ADBench was much better engineered - among other things, it had a sensible and standardised interface for validation and performance measurements. It was relatively easy to write new code and slot it into the infrastructure. This may seem like a very basic expectation, like praising a chef for washing their hands, but many widely used benchmark suites fail to reach this level.
ADBench consists of four benchmarks implemented in over a dozen
"tools" (a term covering both languages and libraries). A
paper describes three of the
benchmarks (the fourth, lstm
, seems to have been added later), and
there is also ample
documentation,
as well as scripts for plotting results, and so on. ADBench has seen
fairly wide use in the AD community, and many papers use problems or
code taken from ADBench.
Unfortunately, ADBench is not perfect. The biggest problem is that development ceased in 2021 and the repository was marked as "archived" in 2024. Since ADBench was (and remains) free software, one option is to fork it and maintain it ourselves. However, while ADBench was certainly the best engineered benchmark suite of its time, it still has two significant problems.
The first problem is that the architecture is tightly coupled. The various tools and infrastructure programs have a hard-coded idea of which benchmark problems exist, and how they should be handled. Adding a new benchmark problem requires you to make a lot of little changes in many different places, including modifying those tools for which you don't actually intend to implement that benchmark (either now or ever). It is also tricky to add tools in a new languages, as you are required to implement various I/O routines of various nonstandard data formats - unless your tool is in C++ or Python (or can pretend to be), in which case you could use some shared code. Finally, a bunch of the automation (which you must modify, remember!) is written in PowerShell, which I find difficult to modify without it being materially better (at least for this purpose) than more common languages.
The second problem with ADBench is that it is difficult to "run everything". ADBench is a polyglot benchmark suite with tools written in different languages or using exotic libraries. This is good, but it is not good when users have to manually set up an environment where all such dependencies are satisfied. There is a Dockerfile that installs a subset of the necessary dependencies (those needed for the major tools, essentially), but it has the problem that all dependencies must coexist within the same Docker image, including ADBench's own infrastructure code. This can cause serious trouble if two tools require mutually exclusive dependencies (e.g. different versions of some library or compiler), or worse yet, are incompatible with the dependencies of the ADBench infrastructure itself. This is a problem that is hard to solve, but solving it is crucial to making a polyglot benchmark suite (one with a rich diversity of languages) viable - and this is something I personally care about a great deal.
The Design of GradBench
In 2024, Sam Estep started the GradBench project - an effort to develop a new benchmark suite for automatic differentiation. While the initial goal was to reach parity with ADBench (by porting its benchmarks), the overall design was rather different, and more ambitious - in particular, it is intended to allow benchmarks that measure things other than raw numerical throughput. However, for the purposes of this post (and my own research interests), the interesting part about GradBench is its highly decoupled design.
GradBench is built around benchmarks, called evals, communicating with tools through a simple message passing protocol based on JSON Lines that is transmitted over standard input and output. The eval sends orders to the tools (such as "run the function with this name on this input"), and the tool responds with output and runtime information, which the eval then verifies. The messages pass through another program, the intermediary, which inspects the messages and uses them to print a human-readable summary of what is happening. The raw log is also available and can be processed by scripts to perform plotting or any other analysis of interest.
The tool is the simplest component in all this - it just does as it is
told, and there are very few things it can be told to do. Evals are a
little more complicated, as they are in control of workloads and
validation. The intermediary is by far the most complicated part, not
just because it has to deal with a user interface, but it must also
orchestrate execution of the eval and tool, and handle things like
timeouts or protocol violations. This complexity gradient is as it
should be: we expect a large number of tools to be written, a smaller
number of evals, and only one intermediary to ever exist. The tools
and evals can be written in whatever language is expedient, as long as
they implement the protocol. The intermediary itself is written in
Rust, as a program called gradbench
.
The interesting part of this design is that due to the extremely simple way the evals and tools communicate with the outside world (exclusively stdin/stdout), it is very easy to run them inside containers such as Docker images. This allows the use of arbitrarily exotic dependencies, as long as they can be described in a Dockerfile. And indeed, although GradBench does not require the eval/tool processes to run inside Docker, all included evals and tools are supplied alongside Dockerfiles, and the intermediary itself has convenience commands for building and running them.
As an example, assuming one has a working Rust compiler (to compile
the gradbench
CLI itself, although you could also download a
precompiled version), this is how to run the hello
eval with the
enzyme
tool:
$ ./gradbench run --eval './gradbench repo eval hello' --tool './gradbench repo tool enzyme'
...
[0] start hello (enzyme)
[1] def hello 4.531 s ✓
[2] eval hello::square 1.0 4ms ~ 0ms prepare, 0ms evaluate ✓
[4] eval hello::double 1.0 5ms ~ 0ms prepare, 0ms evaluate ✓
[6] eval hello::square 2.0 4ms ~ 0ms prepare, 0ms evaluate ✓
[8] eval hello::double 4.0 4ms ~ 0ms prepare, 0ms evaluate ✓
[10] eval hello::square 8.0 5ms ~ 0ms prepare, 0ms evaluate ✓
[12] eval hello::double 64.0 4ms ~ 0ms prepare, 0ms evaluate ✓
[14] eval hello::square 128.0 5ms ~ 0ms prepare, 0ms evaluate ✓
[16] eval hello::double 16384.0 4ms ~ 0ms prepare, 0ms evaluate ✓
The "..." part may consist of a large amount of Docker build output,
depending on whether the images are already available on a given
machine. While the hello
eval is (as the name implies) not an
interesting benchmark, AD veterans will recognise that
Enzyme is not exactly trivial to set up
since it depends on specific versions of LLVM, and yet that is all
hidden by the GradBench automation.
The gradbench
CLI accepts various extra options, such as where to
write the raw log messages, as well as whether to impose a timeout on
tool responses. The --eval
and --tool
arguments are passed shell
commands that run eval and tool processes respectively. In this case,
the ./gradbench repo eval
and ./gradbench repo tool
convenience
commands automate the boilerplate of building and running Docker
images. If we wish to run eval and tool processes outside of Docker
(perhaps for direct hardware access, or because Docker can be
inconvenient during development), we can just pass some other command.
Further, the tools and evals can accept additional options to control
their behaviour (e.g. which workloads) to use, although the defaults
are supposed to be sensible.
What GradBench provides today
As of this writing, GradBench contains contains 11 benchmarks and 17 different tools, with a total of 109 implementations. The coverage is somewhat inconsistent, however there are benchmarks that are implemented by almost every tool, and some tools that implement every benchmark.
The hello
benchmark is the only one to be implemented by every tool,
but it is not a very useful benchmark, as the problem is simply x²,
and every tool can handle this perfectly. It mostly exists as a way
to test that the basic infrastructure is working. The zygote
and
floretta
tools only implement hello
, but we expect this to
improve.
Once we move away from such pathological cases, GradBench does offer
a compelling set of implementations. It contains ports of all of the
ADBench problems, namely
gmm
,
ba
,
lstm
,
and ht
.
These are each implemented with a minimum of 10 tools, and gmm
is
implemented with 14. This means implementing one of these with your
new fancy tool is a pretty easy way to quickly be able to compare with
a lot of prior work.
GradBench also contains ports of all of the benchmarks from cmpad, as well as the two of the problems from the AD Rosetta Stone.
The currently implemented tools include a variety of C++ libraries, including both classics (Adept, CppAD, ADOL-C) and newer libraries such as CoDiPack. All benchmarks have also been implemented in Enzyme, using the C++ frontend. Apart from this rich bouquet of C++, GradBench also provides implementations in a variety of more exotic languages. In particular, Futhark (in which I have a particular interest) implements every benchmark, but there are also implementations in Haskell, OCaml, and and a variety of Python libraries.
Two of the GradBench tools are not truly AD tools: manual
and
finite
. The manual
tool contains programs that have been
differentiated by hand. We expect that in most cases, the
hand-differentiated versions should be the fastest, as they may
exploit mathematical properties that it is not reasonable to expect of
an AD tool. However, these are only algorithmic improvements: a tool
may beat manual
through operational advantages, such as efficient
implementations of primitives like matrix multiplication, parallel
execution, etc. that the manual
implementation of a tool does not
perform. It is not expected manual
contains implementations of all
hand-differentiated versions of all evals. AD is after all most useful
for those cases where hand-differentiation is impractical. The
finite
tool simply uses finite
differences to
compute the derivatives. This is certainly convenient, but usually
much slower than AD.
My estimate is that GradBench is currently the largest benchmark suite for automatic differentiation. The closest competitor would be ADBench itself, which is mostly because ADBench still contains implementations with some languages and libraries that are not in GradBench (in particular, various flavours of MATLAB).
All of the benchmarks naturally come with fully automated validation of results, and timing code that is at least somewhat reliable.
Contributing to GradBench
One of the primary ways in which a person with an interest in AD can contribute to GradBench is by improving the current benchmark implementations, adding entirely new implementations, adding new tools, or adding new evals. I will now explain how to do each of these. If you have any trouble, you are welcome to contact us for help, either via GitHub or the Discord platform for automatic differentiation. It may also be a good idea to read CONTRIBUTING.md.
Improving an implementation
Each coloured square on gradben.ch corresponds to an implementation of a benchmark in some tool. Some of them are known to be good, some of them we think are good, and some of them we know are bad. It would be good to have tool experts take a look at either make an improvement, or let us know that they think the implementation is good. It is not important for an implementation to be the fastest it can possibly go - GradBench is supposed to demonstrate high-quality and idiomatic code. Think comparison, not competition - it's not about who can get the smallest number at any cost. This does require good faith on behalf of contributors, but given the prize that is at stake (nothing), I am not too worried.
For these tools I think the implementations are good, but I am not
completely sure: adept
, adol-c
, codipack
, and cppad
.
I am quite sure that the tensorflow
implementations of gmm
and
ba
are too slow, and so are the pytorch
implementations of ht
and lstm
. It would be very good if someone familiar with these
libraries could take a look - this probably does not require deep
expertise.
Where to find the implementations depends on the tool, although a tool
foo
will always have a file tools/foo/Dockerfile
that will show
where the code is located. For the C++ implementations, the code is
usually in the
tools/
directory itself, which contains a subdirectory for each tool. For
example, this is the cppad
implementation of
gmm
.
For the Python-based tools, tools/
do not contain the actual
implementation code; instead it is in
python/gradbench/gradbench/tools.
After making a modification, you can run your changes using the
gradbench
CLI as shown above. It can be a bit awkward to do this, as
rebuilding the Docker images may take a little while, depending on
your changes. It is possible to run the tool outside Docker - the
specifics vary based on the tool, but here is the command for running
the enzyme
tool:
$ python3 python/gradbench/gradbench/cpp.py enzyme
You would pass this command as the --tool
argument to gradbench run
. All of the C++-based tools can be run like this. It does of
course require you to set up the necessary dependencies yourself (the
shell.nix
can help with this, but now we're getting too far afield).
Adding an implementation for an existing tool
For some tools, the infrastructure has been built (speaking the protocol, writing the Dockerfile), but not yet implementations of all benchmarks. Sometimes this is because we have not gotten around to it, but at other times it is because those benchmarks require something that is tricky to do in a specific tool.
To add a new implementation, the easiest approach is to pattern match
based on an existing implementations. For the C++ tools, you need to
add a program foo.cpp
where foo
is the name of the benchmark.
I suggest looking at the Enzyme implementations for the boilerplate
input/output reading code, all benchmarks have been implemented in
Enzyme.
As of this writing, we are missing implementations of the
particle
and
saddle
benchmarks in all of the C++ tools based on tape recording. Note that
these are quite fiddly benchmarks, so perhaps not a very motivating
place to start.
We lack an implementation of
kmeans
in
adept
-
largely because kmeans
requires computing a Hessian, which Adept
claims not to support well (but one can probably make the bear dance
somehow). We also lack
cppad
implementations of
ba
and
ht
- I
think these are not so difficult to do.
Our benchmark implementations in
tensorflow
,
pytorch
,
and
jax
are also still quite spotty - these are pretty robust tools, so I
think improving coverage is not so difficult for someone sufficiently
versed in their mysteries.
Generally, look at the missing tag on the GitHub issue tracker to find missing implementations.
Adding a new tool
Adding a new tool is a bit more laborious. At a basic level, a tool is an appropriately named directory in the tools/ directory that contains a Dockerfile, that, when run, behaves like a tool process as specified in the protocol.
If the new tool you want to add is a C++ or Python library, then you
are in luck - you can piggyback on the existing implementations of the
procotol. Otherwise, you will have to implement it yourself. If you
have access to a JSON library in your chosen language, this is not so
difficult. Using gradbench
with the -o
option, to make it dump the
raw message log to a file, is a good way to debug errors in the
protocol implementation. Even if your program is not written in
Python, you may still find it beneficial to use the Python
implementation of the protocol, and then internally execute your
program(s) using some bespoke mechanism. That is in fact how the C++
tools
work.
Once you have implemented the basic boilerplate, and have a working
implementation of
hello
,
you can continue adding more evals as discussed above. I recommend
starting with
llsq
,
as it is very simple.
Due to GradBench's decoupled design, tools have low ongoing maintenance cost, and therefore the project is very open to accepting incomplete or experimental tools. In fact, the only real requirement is that you can write a Dockerfile that sets up the tool in a reliable way. My personal ambition is for GradBench to become as polyglot as possible, so I actively encourage everyone to submit implementations in the weirdest tools they can find. In particular, GradBench has explicit support for tools that are unable to handle all workloads for a given benchmark. It is acceptable and expected that some tools will be unable to handle the largest workloads within the time allotment (currently 10 minutes), and this can be explicitly indicated. The tool will still be considered successful, and the workloads for which it does produce a result will still be part of the published graphs (which I will discuss below). This is intended to make GradBench welcoming to tools that are not focused on numerical performance in absolute terms, or have not yet reached the phase in their development where the desired performance has been reached.
Adding a new benchmark
Adding a benchmark (or eval in GradBench-lingo) is the most laborious form of contribution. A benchmark must be specified in a way that is clear enough for others to understand it (you can judge for yourself to which extent we have succeeded so far), come with some validation mechanism, and also have at least a couple of implementations using various tools.
Similarly to tools, an eval is specified by a subdirectory in
evals/ that
behaves like an eval process as specified in the protocol. There is no
real limit to what an eval can do, although in most cases it will send
the start
message, then a define
message, then evaluate
messages
with various functions and inputs. All of the GradBench evals are
currently written in Python - this is not a hard requirement, but
since evals are not performance-sensitive or particularly complicated,
writing them in Python means you can reuse existing utility libraries.
Beyond the technical effort of specifying and implementing a
benchmark, another question is which benchmarks are worthwhile. The
whole point of GradBench is comparison, so a benchmark is only worth
having if there is an expectation that it shows something interesting
related to AD, and will be implemented by multiple tools. In
particular, a hugely complicated benchmark, that nobody will never
implement, is not particularly interesting. Often, once you isolate
the AD-specific parts, a large benchmark can be reduced to a much more
manageable core. For example,
kmeans
is actually just the AD-relevant core of a k-means clustering
application, rather than the whole thing. I would like for GradBench
to eventually have some larger benchmarks (the current ones are all
fairly small), but I'm worried that they would not serve well as
comparisons.
As a special case, I personally support GradBench adding any benchmark found in an existing benchmark suite, even when those are similar to (or even subsets of) benchmarks already in GradBench. As a specific example, I think it would be a good idea to make sure all benchmarks from the Enzyme benchmark suite are also in GradBench.
Automation and finer points
Apart from simply being a collection of benchmarks, GradBench also has a few amenities for contributors. Perhaps the most significant is a robust Continuous Integration (CI) setup (largely due to Sam's work), by which every eval/tool combination is benchmarked every night, and the results used to populate gradben.ch - click one of the eval names to see a graph of whatever metrics are appropriate to the benchmark. In most cases this is the runtime of some "primal" or "objective" function, the runtime of using AD to compute the gradient or a whole Jacobian, and the ratio of the two. The benchmarks are all run on virtual machines on GitHub Actions, and they are entirely sequential, so the resulting data is hardly perfect for every use, but I find it rather useful and interesting.
Of course, GradBench is not perfect. One problem is that it is not done. For example:
Some of the benchmarks do not have working plots yet. In particular,
particle
andsaddle
are somewhat different in what they measure, and need bespoke plotting code.Ironically given my criticism of ADBench, it is actually not trivial with GradBench to run all tools for all benchmarks. You need to manually type in (or script) all of the combinations, and some of them (e.g. the ones where the tool does not implement the benchmark) are expected to fail. The
gradbench
tool does have some logic for handling this, which is used in CI, but it's tied together with a mixture of YAML and shell script. Some polish remains to be added.It is not possible to use the website plotting code for locally generated. I have written a script that uses gnuplot to generate plots based on log files, but it is somewhat crude and very much hidden.
GradBench is not well tested, and some of the automation makes assumptions on how Docker works that is not true in all variants of Docker.
The website could be more useful:
Benchmark implementations should link directly to the corresponding code.
Benchmarks should link to their description.
The raw log files should be linked.
The plots could be more interactive, e.g. with precise values shown on hover.
The issues above will be addressed in time, simply by writing more code. Other problems are more tricky to address, and arise from basic tradeoffs in the design of GradBench.
It is not easy to run GradBench implementations outside of GradBench. In ADBench, most tools were ultimately in the form of a command line program that you passed a data file - this made it easy to disregard ADBench's automation and run things manually. This is not so easy in GradBench, where the only guaranteed interface is the GradBench protocol itself. Some of the tools do have an ADBench-like CLI interface used internally (this is the case for all of the C++ tools), but most GradBench evals do not make use of data files - they tend to generate the input on demand. Thus, in order to use these CLI programs, you must extract the input field of interest from a JSON log file produced by
gradbench
(script to do this), put it in a JSON file, and pass that to the executable.Docker images are great for isolating software dependencies, but you cannot use containerization to resolve hardware dependencies. For example, some of our tools are able to use special hardware (most commonly GPUs), but none of the Docker images have GPU passthrough support, and we probably do not want to enable that by default.
The tools are currently all run using a single thread, which in 2025 seems almost quaint. On its own sequential execution is actually fine in order to isolate when a performance difference between tools can be attributed to parallelism, but it seems clearly desirable to measure both the parallel and sequential performance of a tool that supports it, but it is not clear how this fits into the design. Should parallel and sequential executions of fundamentally the same code be considered distinct tools as far as GradBench is concerned? This could quickly lead to a large proliferation of very similar tools.
These issues can't be solved simply by hacking on code, but must be addressed with a combination of documentation, careful design, and probably also some hacking as well.