Contents

How Caching Works

How Caching Works

The Vesta builder stores the results of function calls in the Vesta cache server. (This includes everything from the result of running tools like compilers to complete builds.) When a build needs to evaluate some function, it first checks to see if the result of a matching function call is stored in the cache. If it finds a match, it re-uses the result instead of doing the work of the function call again. This keeps Vesta builds from unnecessarily repeating work which has been done before.

This page gives a brief overview of how this works.

Recording Dependencies

Dependencies in Vesta are recorded dynamically as a build is performed. This is different from systems like make, where dependencies are specified explicitly in a file of instructions for the builder.

Each value in the builder's language is the result of some expression. Evaluating an expression both computes the value and records along with it what it dependend upon. This dependency information is carried along with the value wherever it goes, even when if it gets stored in the cache and later re-used after a cache hit. An important thing to note is that, because they are recorded dynamically, dependencies propagate from the bottom up.

For example, when evaluating the expression "_append(a,b)", the result value will be the concatenation of a and b plus a record that says "this value depended on the value of the variable a and the value of the variable b".

The dependencies recorded on the return value of a function are used by the cache to check for a match. This allows a previously produced result to be re-used whenever the variables (including source files) it actually depended upon match.

Here's a slightly more interesting example:

f(a:bool, b:binding) {
  return if a then b/x else b/y;
};

The dependencies recorded for this function will depend on the arguments it's called with.

Call	Result	Dependencies
`f(TRUE, [x=1, y=2])`	`1`	a, b/x
`f(FALSE, [x=1, y=2])`	`1`	a, b/y

If both of these function calls were stored in the cache, the evaluator could get cache hits on either of these function calls:

f(TRUE, [x=1, y=3])
f(FALSE, [x=0, y=2])

In both of these cases, all of the dependencies would match the values of the new function call, even though some of the arguments are different.

When running a compiler or other tool, Vesta monitors its filesystem operations. This allows it to record which files a tool run actually depended upon.

The entire filesystem seen by the tool is passed to _run_tool in the binding ./root. (This is called filesystem encapsulation.) Depepndencies on files are recorded on names in this binding. For compiling a simple C program, some of the dependencies might be:

./root/usr/bin/gcc	The compiler binary
./root/lib/i686/libc.so.6	A shared library used by the compiler
./root/.WD/hello.c	The source file being compiled
./root/usr/include/stdio.h	A header file it includes

Primary Key vs. Secondary Dependencies

Each function call has some set of values which we know it will always depends on.

Calling a user-defined function will always depend on the block of statements which define it.
Running a tool will always depend on its command line.

Unlike dependencies recorded while evaluating a function, we know that the result of the function will depend on these values before it's ever called. Vesta uses this fact to make looking up entries in the cache more efficient. It combines the values of these known dependencies into the PrimaryKey of the function call. Once this is known, the cache only needs to search for a dependency match among all entries with the same PrimaryKey. Even though the cache may contain many thousands of entries, there are far fewer with the same PrimaryKey. However, it's possible for tens or even hundreds of entries to share the same PrimaryKey.

The PrimaryKey is actually a 128-bit number which is computed from the value by a process called fingerprinting. This is essentially a checksum of the values. If any of the values changes, the fingerprint will change. Fingerprints are also used to represent the secondary dependency values, to make searching for a match more efficient.

Any recorded dependencies which are not part of the PrimaryKey will be secondary dependencies. (Sometimes these are also called free variables.)

For user-defined SDL functions, the PrimaryKey is made by combining:

The body of the function
Any arguments which are simple types: booleans, integers, and text strings
Any other arguments which are preceded by /**pk**/ in the function definition

The secondary dependencies of SDL functions include:

Arguments which are composite values: bindings, lists, and functions
Variables from the function's definition context
Any values used from the special final parameter dot (.)

For _run_tool calls, the PrimaryKey is made by combining:

The command line
The environment variables
The standard input
The working directory when the tool is started
The machine type the tool is run on

The secondary dependencies of _run_tool calls are the filesystem accesses it makes, recorded relative to ./root.

Models as Cut-off Points

Each SDL model file is itself a function. They're evaluated just like user-defined functions, but they're cached a little differently. Each model gets two cache entries: an inner (or normal) one and an outer (or special) one.

The normal cache entry has secondary dependencies recorded against the variables created in the model's files and import clauses. This is similar to the way that a function records secondary dependencies against variables it uses from its definition context.

The names from the files and import clauses are not like parameters, in that the caller doesn't pass them in. They don't exist in the calling scope, so dependencies on them can't propagate up to the caller. For that reason, the special cache entry only records secondary dependencies against dot (.), which is the only parameter to a model. To ensure correctness, the special cache entry includes the entire immutable directory in which the model is stored in the PrimaryKey. (This necessarily includes the full text of the model itself.)

The special model cache entry is more coarse-grained than the normal cache entry. The evaluator might get a cache miss on the special cache entry because some unused file in the same directory has changed. However, it would then get a cache hit on the correponding normal model cache entry.

Kinds of Dependencies

The evaluator records several different kinds of dependencies that are more fine-grained that the whole value of some variable. For example, testing for the presence of a name in a binding only depends on whether that name is present, not what its associated value is.

Each dependency kind is represented by a different character. An individual dependency is written as one of these characters followed by a slash followed by the path to the value in question. You may see these in debug output from the evaluator or in the output from cache inspection tools.

Character	Meaning	Example
N	Whole value	Using a variable: `x` => `N/x`
!	Existence of a name within a binding	Testing for a name in a binding: b!n => `!/b/n`
T	Type	Checking the type of a value: _is_bool(a) => `T/a`
L	List length	Taking the length of a list: _length(l) => `L/l`
B	The set of names in a binding	Looping over the elements of a binding: _map(f,b) => `B/b`, ...
E	Body of a function	Calling a function: `f()` => `E/f`, ...

When running a tool, different filesystem operations are recorded as different kinds of dependencies:

Operation	Dependency Kind
Opening a file	N
Looking for a file that doesn't exist	!
Listing a directory	B
Looking for a directory but not looking inside it (rare)	T

Cache Organization

The cache is essentially a multi-level hash table.

The first level is based on the PrimaryKey. The collection of all entries with the same PrimaryKey is called a PKFile. Because there can be a very large number of these (theoretically as many as 2^128), the cache groups PKFiles together on disk based on their most significant bits. (Currently this grouping is done using the first 16 bits.) Such a collection of PKFiles with the same prefix is called a MultiPKFile. PrintMPKFile prints the contents of a MultiPKFile.

The remaining levels take advantage of the fact that in most cases there are some secondary dependencies which are common to all cache entries with the same PrimaryKey. The cache keeps a single list of all secondary dependency paths for each PrimaryKey. Along with that list, it keeps a record of which dependency paths all the entries of that PrimaryKey depend upon and which ones only some depend upon. These are called common secondary dependencies and uncommon secondary dependencies.

The second level is indexed by the combination of all the common secondary dependencies. This groups the cache entries of the PKFile into common fingerprint groups, sometimes abbreviated to CFP groups.

The third level is indexed by the combination of the uncommon secondary dependencies for each entry. These are combined for each entry to create an uncommon fingerprint.

Cache Lookup Process

A cache lookup proceeds in several steps.

The evaluator computes the PrimaryKey of a function call. This includes combining any arguments marked with /**pk**/ for user-defined functions, or the entire immutable directory contents for a special model cache entry.
The evaluator makes a FreeVariables call to the cache. The cache locates the PKFile for the given PrimaryKey (if any exists) and returns the complete set of all secondary dependencies in that PKFile (both common and uncommon) to the evaluator.
The evalutor computes the fingerprint of all the secondary dependencies returned by the FreeVariables call.
The evalutor makes a Lookup call to the cache, passing the set of fingerprints it computed for the secondary dependencies.
The cache combines the fingerprints for all the common secondary dependencies to form the common fingerprint. It uses this to look up the right common fingerprint group. If there is no matching CFP group, it returns a miss back to the evaluator.
The cache searches through all entries in the CFP group for a match on the uncommon secondary dependencies. If a match is found, it's returned as a hit back to the evaluator. Otherwise, a miss is returned to the evaluator.

You can observe the communication between the evaluator and cache, including the arguments to and results from each FreeVariables and Lookup call, by adding "-cdebug All" to the evaluator command line.

Some additional details on this process:

When there is no PKFile for the given PrimaryKey, the FreeVariables result has isEmpty=true. This is treated as a cache miss without proceeding to the Lookup step.
Part of the return value of FreeVariables is a PK epoch which is incremented every time the set of secondary dependencies changes (either by adding or deleting entries). This is passed back to the cache in the Lookup call to determine if the set of secondary dependencies has changed since the FreeVariables call. If the PK epoch passed to Lookup doesn't match the current one in the cache, there's been some change since the FreeVariables call and the cache returns FVMismatch. This causes the evaluator to start over with the FreeVariables call.

Adding a new cache entry

Explain how cache entries get added

Cache entry deletion

Explain how cache entries get deleted durig weeding

Wiki

User