The GradBench Benchmark Suite for Automatic Differentiation

Posted on April 3, 2025

This is a post about the GradBench benchmark suite initiated by Sam Estep. It is a companion post to my presentation at EuroAD 2025, and contains material and elaboration that was not a good fit for the presentation. I'm merely a GradBench contributor, so everything below is my opinion, and although it is well-reasoned and correct, like anything else on this website, it is not the official opinion of GradBench, whatever that might mean. With this post, I hope to achieve the following:

If you are an AD user, or just interested, convince you to contribute benchmark implementations.
If you are the implementer of some AD tool, convince you to add benchmark implementations using your tool, and to use GradBench's implementations in your own work.
If you are a person with a suitable computer, that you will help test that GradBench runs for you.

Background

In applied computer science, an important part of developing a new tool or algorithm is to compare it against existing work. (Some use the word "competitors", but I feel this sounds a bit too antagonistic.) Qualitative comparisons purely based on the texts of papers can only go so far: a good comparison requires you to solve the same problems using both your new tool and existing tools, and comparing the outcomes. This comparison can be both quantitative ("how fast is it?") and qualitative ("is it easy or elegant to use?"), but it is in most cases also time consuming. It takes significant effort to faithfully solve some problem using different tools, of which one is unlikely to be an expert in all of them. At the end, you are often left with some doubt about whether you truly represented your competitors (there's that word again...) in the best way - and your peers will wonder the same!

A solution to this problem is to have a benchmark suite: a collection of problems, along with their solutions using various existing tools. When proposing some new tool, you then only have to solve these hopefully well-described problems once, with your new tool, and compare the result with the existing solutions. Some desirable properties of a benchmark suite include:

It should contain a variety of problems that are qualitatively different, in order to have a good likelihood of demonstrating the strengths and weaknesses of any given tool.
The implementations included should be of high quality.
It should be easy to run, and there should be confidence that the results are meaningful (i.e., you need validation of results).
It should be easy to add a new implementation or problem.

While these are technical properties, some of them are best ensured using a combination of technical and social means. In particular, it is unlikely that any single person will be an expert on all tools available in some community, so property (2) is best ensured by making it easy for experts to contribute improvements as they are able. This implies lowering the barrier to entry, particularly by avoiding complicated setup requirements. The most useful benchmark suites are communally maintained by the communities they serve.

ADBench

Through Cosmin Oancea, I became interested in automatic differentiation some years ago, and it remains a research area that I find deeply fascinating. Our main research activity concerned the development of an AD transformation for Futhark. To evaluate the effectiveness of our work, we implemented problems from ADBench, a benchmark suite developed at Microsoft Research in roughly 2018-2020. Compared to the benchmark suites I had previously encountered (mostly for parallel programming), ADBench was much better engineered - among other things, it had a sensible and standardised interface for validation and performance measurements. It was relatively easy to write new code and slot it into the infrastructure. This may seem like a very basic expectation, like praising a chef for washing their hands, but many widely used benchmark suites fail to reach this level.

ADBench consists of four benchmarks implemented in over a dozen "tools" (a term covering both languages and libraries). A paper describes three of the benchmarks (the fourth, lstm, seems to have been added later), and there is also ample documentation, as well as scripts for plotting results, and so on. ADBench has seen fairly wide use in the AD community, and many papers use problems or code taken from ADBench.

Unfortunately, ADBench is not perfect. The biggest problem is that development ceased in 2021 and the repository was marked as "archived" in 2024. Since ADBench was (and remains) free software, one option is to fork it and maintain it ourselves. However, while ADBench was certainly the best engineered benchmark suite of its time, it still has two significant problems.

The first problem is that the architecture is tightly coupled. The various tools and infrastructure programs have a hard-coded idea of which benchmark problems exist, and how they should be handled. Adding a new benchmark problem requires you to make a lot of little changes in many different places, including modifying those tools for which you don't actually intend to implement that benchmark (either now or ever). It is also tricky to add tools in a new languages, as you are required to implement various I/O routines of various nonstandard data formats - unless your tool is in C++ or Python (or can pretend to be), in which case you could use some shared code. Finally, a bunch of the automation (which you must modify, remember!) is written in PowerShell, which I find difficult to modify without it being materially better (at least for this purpose) than more common languages.

The second problem with ADBench is that it is difficult to "run everything". ADBench is a polyglot benchmark suite with tools written in different languages or using exotic libraries. This is good, but it is not good when users have to manually set up an environment where all such dependencies are satisfied. There is a Dockerfile that installs a subset of the necessary dependencies (those needed for the major tools, essentially), but it has the problem that all dependencies must coexist within the same Docker image, including ADBench's own infrastructure code. This can cause serious trouble if two tools require mutually exclusive dependencies (e.g. different versions of some library or compiler), or worse yet, are incompatible with the dependencies of the ADBench infrastructure itself. This is a problem that is hard to solve, but solving it is crucial to making a polyglot benchmark suite (one with a rich diversity of languages) viable - and this is something I personally care about a great deal.

The Design of GradBench

In 2024, Sam Estep started the GradBench project - an effort to develop a new benchmark suite for automatic differentiation. While the initial goal was to reach parity with ADBench (by porting its benchmarks), the overall design was rather different, and more ambitious - in particular, it is intended to allow benchmarks that measure things other than raw numerical throughput. However, for the purposes of this post (and my own research interests), the interesting part about GradBench is its highly decoupled design.

GradBench is built around benchmarks, called evals, communicating with tools through a simple message passing protocol based on JSON Lines that is transmitted over standard input and output. The eval sends orders to the tools (such as "run the function with this name on this input"), and the tool responds with output and runtime information, which the eval then verifies. The messages pass through another program, the intermediary, which inspects the messages and uses them to print a human-readable summary of what is happening. The raw log is also available and can be processed by scripts to perform plotting or any other analysis of interest.

The tool is the simplest component in all this - it just does as it is told, and there are very few things it can be told to do. Evals are a little more complicated, as they are in control of workloads and validation. The intermediary is by far the most complicated part, not just because it has to deal with a user interface, but it must also orchestrate execution of the eval and tool, and handle things like timeouts or protocol violations. This complexity gradient is as it should be: we expect a large number of tools to be written, a smaller number of evals, and only one intermediary to ever exist. The tools and evals can be written in whatever language is expedient, as long as they implement the protocol. The intermediary itself is written in Rust, as a program called gradbench.

The interesting part of this design is that due to the extremely simple way the evals and tools communicate with the outside world (exclusively stdin/stdout), it is very easy to run them inside containers such as Docker images. This allows the use of arbitrarily exotic dependencies, as long as they can be described in a Dockerfile. And indeed, although GradBench does not require the eval/tool processes to run inside Docker, all included evals and tools are supplied alongside Dockerfiles, and the intermediary itself has convenience commands for building and running them.

As an example, assuming one has a working Rust compiler (to compile the gradbench CLI itself, although you could also download a precompiled version), this is how to run the hello eval with the enzyme tool:

$ ./gradbench run --eval './gradbench repo eval hello' --tool './gradbench repo tool enzyme'
...
  [0] start hello (enzyme)
  [1] def   hello                               4.531 s ✓
  [2] eval  hello::square   1.0                     4ms ~         0ms prepare,         0ms evaluate ✓
  [4] eval  hello::double   1.0                     5ms ~         0ms prepare,         0ms evaluate ✓
  [6] eval  hello::square   2.0                     4ms ~         0ms prepare,         0ms evaluate ✓
  [8] eval  hello::double   4.0                     4ms ~         0ms prepare,         0ms evaluate ✓
 [10] eval  hello::square   8.0                     5ms ~         0ms prepare,         0ms evaluate ✓
 [12] eval  hello::double   64.0                    4ms ~         0ms prepare,         0ms evaluate ✓
 [14] eval  hello::square   128.0                   5ms ~         0ms prepare,         0ms evaluate ✓
 [16] eval  hello::double   16384.0                 4ms ~         0ms prepare,         0ms evaluate ✓

The "..." part may consist of a large amount of Docker build output, depending on whether the images are already available on a given machine. While the hello eval is (as the name implies) not an interesting benchmark, AD veterans will recognise that Enzyme is not exactly trivial to set up since it depends on specific versions of LLVM, and yet that is all hidden by the GradBench automation.

The gradbench CLI accepts various extra options, such as where to write the raw log messages, as well as whether to impose a timeout on tool responses. The --eval and --tool arguments are passed shell commands that run eval and tool processes respectively. In this case, the ./gradbench repo eval and ./gradbench repo tool convenience commands automate the boilerplate of building and running Docker images. If we wish to run eval and tool processes outside of Docker (perhaps for direct hardware access, or because Docker can be inconvenient during development), we can just pass some other command. Further, the tools and evals can accept additional options to control their behaviour (e.g. which workloads) to use, although the defaults are supposed to be sensible.

What GradBench provides today

As of this writing, GradBench contains contains 11 benchmarks and 17 different tools, with a total of 109 implementations. The coverage is somewhat inconsistent, however there are benchmarks that are implemented by almost every tool, and some tools that implement every benchmark.

The hello benchmark is the only one to be implemented by every tool, but it is not a very useful benchmark, as the problem is simply x², and every tool can handle this perfectly. It mostly exists as a way to test that the basic infrastructure is working. The zygote and floretta tools only implement hello, but we expect this to improve.

Once we move away from such pathological cases, GradBench does offer a compelling set of implementations. It contains ports of all of the ADBench problems, namely gmm, ba, lstm, and ht. These are each implemented with a minimum of 10 tools, and gmm is implemented with 14. This means implementing one of these with your new fancy tool is a pretty easy way to quickly be able to compare with a lot of prior work.

GradBench also contains ports of all of the benchmarks from cmpad, as well as the two of the problems from the AD Rosetta Stone.

The currently implemented tools include a variety of C++ libraries, including both classics (Adept, CppAD, ADOL-C) and newer libraries such as CoDiPack. All benchmarks have also been implemented in Enzyme, using the C++ frontend. Apart from this rich bouquet of C++, GradBench also provides implementations in a variety of more exotic languages. In particular, Futhark (in which I have a particular interest) implements every benchmark, but there are also implementations in Haskell, OCaml, and and a variety of Python libraries.

Two of the GradBench tools are not truly AD tools: manual and finite. The manual tool contains programs that have been differentiated by hand. We expect that in most cases, the hand-differentiated versions should be the fastest, as they may exploit mathematical properties that it is not reasonable to expect of an AD tool. However, these are only algorithmic improvements: a tool may beat manual through operational advantages, such as efficient implementations of primitives like matrix multiplication, parallel execution, etc. that the manual implementation of a tool does not perform. It is not expected manual contains implementations of all hand-differentiated versions of all evals. AD is after all most useful for those cases where hand-differentiation is impractical. The finite tool simply uses finite differences to compute the derivatives. This is certainly convenient, but usually much slower than AD.

My estimate is that GradBench is currently the largest benchmark suite for automatic differentiation. The closest competitor would be ADBench itself, which is mostly because ADBench still contains implementations with some languages and libraries that are not in GradBench (in particular, various flavours of MATLAB).

All of the benchmarks naturally come with fully automated validation of results, and timing code that is at least somewhat reliable.

Contributing to GradBench

One of the primary ways in which a person with an interest in AD can contribute to GradBench is by improving the current benchmark implementations, adding entirely new implementations, adding new tools, or adding new evals. I will now explain how to do each of these. If you have any trouble, you are welcome to contact us for help, either via GitHub or the Discord platform for automatic differentiation. It may also be a good idea to read CONTRIBUTING.md.

Improving an implementation

Each coloured square on gradben.ch corresponds to an implementation of a benchmark in some tool. Some of them are known to be good, some of them we think are good, and some of them we know are bad. It would be good to have tool experts take a look at either make an improvement, or let us know that they think the implementation is good. It is not important for an implementation to be the fastest it can possibly go - GradBench is supposed to demonstrate high-quality and idiomatic code. Think comparison, not competition - it's not about who can get the smallest number at any cost. This does require good faith on behalf of contributors, but given the prize that is at stake (nothing), I am not too worried.

For these tools I think the implementations are good, but I am not completely sure: adept, adol-c, codipack, and cppad.

I am quite sure that the tensorflow implementations of gmm and ba are too slow, and so are the pytorch implementations of ht and lstm. It would be very good if someone familiar with these libraries could take a look - this probably does not require deep expertise.

Where to find the implementations depends on the tool, although a tool foo will always have a file tools/foo/Dockerfile that will show where the code is located. For the C++ implementations, the code is usually in the tools/ directory itself, which contains a subdirectory for each tool. For example, this is the cppad implementation of gmm. For the Python-based tools, tools/ do not contain the actual implementation code; instead it is in python/gradbench/gradbench/tools.

After making a modification, you can run your changes using the gradbench CLI as shown above. It can be a bit awkward to do this, as rebuilding the Docker images may take a little while, depending on your changes. It is possible to run the tool outside Docker - the specifics vary based on the tool, but here is the command for running the enzyme tool:

$ python3 python/gradbench/gradbench/cpp.py enzyme

You would pass this command as the --tool argument to gradbench run. All of the C++-based tools can be run like this. It does of course require you to set up the necessary dependencies yourself (the shell.nix can help with this, but now we're getting too far afield).

Adding an implementation for an existing tool

For some tools, the infrastructure has been built (speaking the protocol, writing the Dockerfile), but not yet implementations of all benchmarks. Sometimes this is because we have not gotten around to it, but at other times it is because those benchmarks require something that is tricky to do in a specific tool.

To add a new implementation, the easiest approach is to pattern match based on an existing implementations. For the C++ tools, you need to add a program foo.cpp where foo is the name of the benchmark. I suggest looking at the Enzyme implementations for the boilerplate input/output reading code, all benchmarks have been implemented in Enzyme.

As of this writing, we are missing implementations of the particle and saddle benchmarks in all of the C++ tools based on tape recording. Note that these are quite fiddly benchmarks, so perhaps not a very motivating place to start.

We lack an implementation of kmeans in adept - largely because kmeans requires computing a Hessian, which Adept claims not to support well (but one can probably make the bear dance somehow). We also lack cppad implementations of ba and ht - I think these are not so difficult to do.

Our benchmark implementations in tensorflow, pytorch, and jax are also still quite spotty - these are pretty robust tools, so I think improving coverage is not so difficult for someone sufficiently versed in their mysteries.

Generally, look at the missing tag on the GitHub issue tracker to find missing implementations.

Adding a new tool

Adding a new tool is a bit more laborious. At a basic level, a tool is an appropriately named directory in the tools/ directory that contains a Dockerfile, that, when run, behaves like a tool process as specified in the protocol.

If the new tool you want to add is a C++ or Python library, then you are in luck - you can piggyback on the existing implementations of the procotol. Otherwise, you will have to implement it yourself. If you have access to a JSON library in your chosen language, this is not so difficult. Using gradbench with the -o option, to make it dump the raw message log to a file, is a good way to debug errors in the protocol implementation. Even if your program is not written in Python, you may still find it beneficial to use the Python implementation of the protocol, and then internally execute your program(s) using some bespoke mechanism. That is in fact how the C++ tools work.

Once you have implemented the basic boilerplate, and have a working implementation of hello, you can continue adding more evals as discussed above. I recommend starting with llsq, as it is very simple.

Due to GradBench's decoupled design, tools have low ongoing maintenance cost, and therefore the project is very open to accepting incomplete or experimental tools. In fact, the only real requirement is that you can write a Dockerfile that sets up the tool in a reliable way. My personal ambition is for GradBench to become as polyglot as possible, so I actively encourage everyone to submit implementations in the weirdest tools they can find. In particular, GradBench has explicit support for tools that are unable to handle all workloads for a given benchmark. It is acceptable and expected that some tools will be unable to handle the largest workloads within the time allotment (currently 10 minutes), and this can be explicitly indicated. The tool will still be considered successful, and the workloads for which it does produce a result will still be part of the published graphs (which I will discuss below). This is intended to make GradBench welcoming to tools that are not focused on numerical performance in absolute terms, or have not yet reached the phase in their development where the desired performance has been reached.

Adding a new benchmark

Adding a benchmark (or eval in GradBench-lingo) is the most laborious form of contribution. A benchmark must be specified in a way that is clear enough for others to understand it (you can judge for yourself to which extent we have succeeded so far), come with some validation mechanism, and also have at least a couple of implementations using various tools.

Similarly to tools, an eval is specified by a subdirectory in evals/ that behaves like an eval process as specified in the protocol. There is no real limit to what an eval can do, although in most cases it will send the start message, then a define message, then evaluate messages with various functions and inputs. All of the GradBench evals are currently written in Python - this is not a hard requirement, but since evals are not performance-sensitive or particularly complicated, writing them in Python means you can reuse existing utility libraries.

Beyond the technical effort of specifying and implementing a benchmark, another question is which benchmarks are worthwhile. The whole point of GradBench is comparison, so a benchmark is only worth having if there is an expectation that it shows something interesting related to AD, and will be implemented by multiple tools. In particular, a hugely complicated benchmark, that nobody will never implement, is not particularly interesting. Often, once you isolate the AD-specific parts, a large benchmark can be reduced to a much more manageable core. For example, kmeans is actually just the AD-relevant core of a k-means clustering application, rather than the whole thing. I would like for GradBench to eventually have some larger benchmarks (the current ones are all fairly small), but I'm worried that they would not serve well as comparisons.

As a special case, I personally support GradBench adding any benchmark found in an existing benchmark suite, even when those are similar to (or even subsets of) benchmarks already in GradBench. As a specific example, I think it would be a good idea to make sure all benchmarks from the Enzyme benchmark suite are also in GradBench.

Automation and finer points

Apart from simply being a collection of benchmarks, GradBench also has a few amenities for contributors. Perhaps the most significant is a robust Continuous Integration (CI) setup (largely due to Sam's work), by which every eval/tool combination is benchmarked every night, and the results used to populate gradben.ch - click one of the eval names to see a graph of whatever metrics are appropriate to the benchmark. In most cases this is the runtime of some "primal" or "objective" function, the runtime of using AD to compute the gradient or a whole Jacobian, and the ratio of the two. The benchmarks are all run on virtual machines on GitHub Actions, and they are entirely sequential, so the resulting data is hardly perfect for every use, but I find it rather useful and interesting.

Of course, GradBench is not perfect. One problem is that it is not done. For example:

Some of the benchmarks do not have working plots yet. In particular, particle and saddle are somewhat different in what they measure, and need bespoke plotting code.
Ironically given my criticism of ADBench, it is actually not trivial with GradBench to run all tools for all benchmarks. You need to manually type in (or script) all of the combinations, and some of them (e.g. the ones where the tool does not implement the benchmark) are expected to fail. The gradbench tool does have some logic for handling this, which is used in CI, but it's tied together with a mixture of YAML and shell script. Some polish remains to be added.
It is not possible to use the website plotting code for locally generated. I have written a script that uses gnuplot to generate plots based on log files, but it is somewhat crude and very much hidden.
GradBench is not well tested, and some of the automation makes assumptions on how Docker works that is not true in all variants of Docker.
The website could be more useful:
- Benchmark implementations should link directly to the corresponding code.
- Benchmarks should link to their description.
- The raw log files should be linked.
- The plots could be more interactive, e.g. with precise values shown on hover.

The issues above will be addressed in time, simply by writing more code. Other problems are more tricky to address, and arise from basic tradeoffs in the design of GradBench.

It is not easy to run GradBench implementations outside of GradBench. In ADBench, most tools were ultimately in the form of a command line program that you passed a data file - this made it easy to disregard ADBench's automation and run things manually. This is not so easy in GradBench, where the only guaranteed interface is the GradBench protocol itself. Some of the tools do have an ADBench-like CLI interface used internally (this is the case for all of the C++ tools), but most GradBench evals do not make use of data files - they tend to generate the input on demand. Thus, in order to use these CLI programs, you must extract the input field of interest from a JSON log file produced by gradbench (script to do this), put it in a JSON file, and pass that to the executable.
Docker images are great for isolating software dependencies, but you cannot use containerization to resolve hardware dependencies. For example, some of our tools are able to use special hardware (most commonly GPUs), but none of the Docker images have GPU passthrough support, and we probably do not want to enable that by default.
The tools are currently all run using a single thread, which in 2025 seems almost quaint. On its own sequential execution is actually fine in order to isolate when a performance difference between tools can be attributed to parallelism, but it seems clearly desirable to measure both the parallel and sequential performance of a tool that supports it, but it is not clear how this fits into the design. Should parallel and sequential executions of fundamentally the same code be considered distinct tools as far as GradBench is concerned? This could quickly lead to a large proliferation of very similar tools.

These issues can't be solved simply by hacking on code, but must be addressed with a combination of documentation, careful design, and probably also some hacking as well.