Build systems: content addressed tracing
An idea I have lying around is something I am going to call "ca-tracing" for the purposes of this post. The concept is to instrument builds and observe what they actually did, and record that for future iterations such that excess dependencies can be ignored if, even if inputs changed, the instructions are the same and the files actually observed by the build are the same.
Implementation
Assumptions
This idea assumes a hermetic build system, since we need to know if anything might have differed from build to build, so we need a complete accounting of the inputs to the build. It is not necessarily the case that such a hermetic build system would be Nix-like, however, it is easiest to describe on top of a Nix-like; first one with build identity, then one that lacks build identity like Nix.
This also assumes a content-addressed build system with early cut-off like Nix with ca-derivations. In Nix's case, input-addressed builds are executed, then renamed to a content-addressed path: if a build with different inputs is executed once more with the same output, it is recorded as resolving to that output, and further builds are cut off.
Conceptual implementation
Conceptually, a build is a function:
(inputs, instructions) -> outputs
We wish to narrow inputs to inputsactual, and save this information alongside outputs. In a following build, we can then verify if instructions' matches a previous build (instructions) and if so, extract the values of the same dynamically observed inputs'actual, but relative to inputs' and compare them to the values of inputsactual from the previous build.
Since our build system is hermetic, if this hits cache, it can be assumed to have identical results, modulo any nondeterminism (which we assume to be unfortunate but unproblematic, and is there regardless of this technique).
Making it concrete
A build ("derivation" in Nix) in a Nix-like system is a specification of:
- Inputs (files, other derivations)
- Environment variables
- Command to execute
The point of ca-tracing is to remove excess inputs, so let's contemplate how to do that.
File names
The inputs are files named based on hash(contents)
in Nix, but we don't
know which contents we will actually access. This is a problem, since the file
paths of inputs need to remain constant across multiple executions of the
build (the paths for inputs must equal the paths for inputs'), since the
part of inputs that changed may be irrelevant to this build.
In a system that doesn't look like Nix, the input file paths might be the same across two builds on account of not containing hashes, so this would not be a problem.
We can solve the file names problem by replacing the hash parts in the input filenames with random values per-run. These hashes should never appear, even in part, in the output, if the builder is not doing things with them that would render the build non-deterministic.
Unfortunately the file names may appear in the output through the ordering of deterministic hash tables, for instance, which could be a problem; this exists in practice in ELF hash tables for instance. Realistically we would need file-type-specific rewriters to fixup execution output to a deterministic result following multiple runs.
We would also have to rewrite those hashes within blocks of data read from within the builder, but that's possibly just a few FUSE crimes away to be able to do live, on-demand.
Following the build, the temporary hashes of the inputs can be substituted for their concrete values pointing to the larger inputs †.
Tracing, filesystem
To trace a build, one would have to pull the filesystem activity. This is possible with some BPF tracing constrained to some cgroup on Linux, so that is not the hard part.
The data that would have to be known is:
- Observed directory listings with hashes
- Read file names matching inputs, with associated hashes
- Extremely annoyingly:
fstat(2)
results for all queried files in inputs (this is extremely annoying because everything callsfstat
all the time pointlessly or to check for files being present, and it includes things like the length of a file, which could in principle cause unsoundness if not recorded).
This would then all be compared to the equivalent paths in inputs' and if the hashes match, the previous build could be immediately used.
Avoiding build identity; how would this work in Nix?
Nix is built on top of an on-disk key-value store (namely, the directory
/nix/store
), which is a mapping:
Hash -> Value
Thus, we just need to construct a hash in such a way that both Build and Build' get the same hash value.
We could achieve this by modifying the derivation in a deterministic manner such that two modified-derivations share a hash if they could plausibly have ca-tracing applied. Specifically, rewrite the input hashes to something like the following:
hash("ca-tracing" + name + position-in-inputs) + "-" + name
When a build is invoked, modify the derivation, hash it, and check for the presence of a record of a modified-derivation of the same hash, and then check if the actually-used filesystem objects when applied to inputs' remain the same.
Use cases
This idea is almost certainly best suited for builds using the smallest possible unit of work, both in terms of usefulness and likelihood of bugs in the rewriting. To use the terminology from Build Systems à la Carte, it is likely most useful for systems that are closer to constructive traces than deep constructive traces.
For example, if this is applied to individual compiler jobs in a C++ project, it can eliminate rebuilds from imprecise build system dependency tracking, whereas if the derivation/unit of work is larger, the rebuild might be necessary anyway.
Problems
-
There could exist multiple instances of a modified-derivation with different filesystem activity, due to, say, a bunch of rebuilds against very differently patched inputs. This system would have to be able to either represent that or just discard old ones.
-
Real programs abuse
fstat(2)
way too much and it's very likely that this whole thing might not actually get any cache hits in practice iffstat
calls are considered. Without visibility into processes we cannot know iffstat
calls' results are actually used for anything more than checking if a file exists.This might benefit from some limited dynamic tracing inside processes to determine whether the fstat result is actually read.
-
The whole enterprise is predicated on generalized sound rewriting, which is likely very hard; see below.
Naive rewriting is a bad idea
The implementation of ca-derivations itself, where it just rewrites hashes
appearing in random binaries with the moral equivalent of sed
, is extremely
unsound with respect to compression, ordered structures (even NAR files would
fall victim to this), and any other kind of non-literal storage of store paths,
and this approach just adds yet more naive rewriting that is likely to explode
spectacularly at runtime.
Naively rewriting store paths is an extension of the original idea of Nix doing runtime dependencies by naively scanning for reference paths. However, crucially, the latter does not modify random binaries without any knowledge of their contents, and the worst case scenario for that reference scanning is a runtime error when someone downloads a binary package.
Realistically, this would have to be done with a "diffoscope of rewriters", which can parse any format and rewrite references in it. We can check soundness of a build under rewriting by simply running it more times. The rewriter need not be a trusted component, since its impact is only as far as breaking your binaries (reproducibly so), which Nix is great at already!
In an actual implementation, I would even go so far as saying the rewriter must not be part of Nix since it is generally useful, and it is fundamentally something that would have to move pretty fast and perhaps even have per-project modifications such that it cannot possibly be in a Nix stability guarantee.
Related work
This is essentially the idea of edef's incomplete project Ripple, an arbitrary-program memoizer, among other work, but significantly scaled down to be less general and possibly more feasible. Compared to her project, this idea doesn't look into processes at all, and simply involves tracing filesystem accesses to read-only resources in an already-hermetic build system.
Thanks to edef for significant feedback and discussion about this post. You can sponsor her on GitHub here if you want to support her work on making computers more sound such as the Nix content addressed cache project, tvix, and also her giving these ideas to Arch Linux developers.