One of Vesta's key features is its guarantee of precise repeatability: any build you perform can be repeated exactly in the future. However, there are limits to this guarantee. While Vesta encapsulates all that most build procedures depend upon, it doesn't run your builds in a virtual machine. (That would make Vesta much less portable and have a significant performance cost.)
What is Guaranteed
The source versions used by a build are immutable and all references to different sources point to specific versions which cannot change over time. This is preserved by the fact that the builder will only evaluate models in immutable directories, and the import clause only allows references to specific immutable versions. So the source files used by a specific version of a build cannot be changed once that version is created.
The source files completely specify the filesystem contents and environment variables used during each build step. The SDL model code runs tools with the _run_tool primitive function which gets its filesystem from the value in ./root and environment variables from ./envVars. These can only come from immutable source files, including other immutable SDL model files, and the results produced by previous tools.
As long as tools are deterministic with respect to their inputs and only make use of file and directory contents and environment variables as inputs, then your builds should be completely repeatable.
How To Break Repeatability
There are a several things which tools can do which will not be precisely repeatable:
Read data from the network. Imagine running wget or scp as a tool. Obviously, this would depend on data not in the source set Vesta doesn't have any way to prevent a tool from opening network connections.
- One potentially unavoidable use of the network is reading license information over the network from a license server. (Some proprietary tools require the use of a license server to ensure that no more than a certain number of instances of a licensed tool are run simultaneously.)
Using values which change over time. There are various pieces of information which a program can obtain from the operating system which change over time:
The system clock. A tool could write the current date/time in its output file. It could also use the current time as a seed for a pseudo-random number generator.
The process ID. Similar to reading the current time, using the process ID would produce different resutls on different executions.
The user ID. While all tools will be run as the special vruntool user, the user ID of that user may be different from one installation to another. If you want your build to produce the same results at remote sites, don't make any use of the user ID.
Using system-provided random sources. Some operating systems provide a source of random data which is better than most pseudo-random number generators (such as /dev/random and /dev/urandom on Linux).
Using the timestamps of derived files. While the timestamps on source files will remain constant once created, files generated by previous tool executions can have different timestamps during different evaluations of the same build. (Suppose the weeder deletes some old build from the cache and it then gets re-evaluated. Also, performin a build at a remote site will re-build the derived files, since replication only copies sources.)
Using information about the execution host. There are several pieces of information which may be available on the current host which tools are executed on, which can change when tools are distributed to different hosts of performed from replicas at other sites:
The hostname and host's IP address.
The specific CPU model. Usually we expect the ability to run current programs on future CPU products (i.e. backwards compatibility). However, CPUs do change and add features over time which may be detectable by programs.
Hardware serial number or identifying information. For example, some versions of Intel CPUs have an individual ID number which can be retrieved by a program.
The specific OS kernel version. Eg, from the output of uname
False pass tool status If a tool fails, but reports a successful exit status anyway, then Vesta will cache its results. If the error was from a transient system error (e.g. the RunToolServer host running out of virtual memory, the repository server running out of file descriptors, etc.), then the next invocation of the tool would not fail in the same way. A well-behaved tool should not exit with successful status when it has really failed, but a buggy tool can cause this sort of situation.
Vesta may cache a good version and a bad version of the same tool run; some builds might cache hit the good version and some the bad. (KenSchalk: Actually, no, it doesn't randomly choose one on each cache lookup. If there are two possible hits for the same invocation, one of them will always be used and the other will never be used. This easiest way to create this situation is if two evaluations simultaneously execute the same tool and add two new entries in parallel. It's also possible that an earlier evaluation might add the good entry and a later evaluation which encounters a transient failure with a buggy tool adds the bad entry which subsequently masks the good entry in future lookups. While it's an interesting bit of trivia, do we really need to discuss it here?)
(JVK: well, then are we saying that the only way this can be non-repeatable is that some of the first N evaluations of this model -- in parallel -- can pass, and one or more of them can fail, and that after that, they all predictably fail (or predictably pass)? It would be worth saying that forever more all cache lookups of the evaluation are repeatable (on this server), and only the first N in parallel might have inconsistent results. That limits/bounds the non-repeatability. Future users/projects considering vesta i believe will like to know the bounds of the issue.)
(JVK: while in most cases this issue might actually be bounded to the first N evaluations, which might seem like a small scope at first, it also seems worth while to note that not all evaluations are created equally. Ie if an evaluation happens to be one involved in validating the release of changes to the entire project, and it experiences this issue, it will make it seem like "a bad release" was let thru. (while a single user's evaluation would have more localized consequences.) It will also look like people can't repeat the results of the released build. In this case, we are experiencing negative results from 2 outcomes of the false pass)
- the false pass exit status has allowed a bad result to be cached in vesta
the false pass exit status has propagated thru to the final result of the build, and the build has been judged a success (an issue in using vesta, not in vesta itself)
(JVK: we could move this text to a details... section)
Another repository with the exact same build that did not have the transient system error will pass (or fail in a different way) instead of fail. (KenSchalk: This seems bullet point seems redundant to me. Isn't this the definition of violating repeatability?)
- (jvk: i would say yes, because it describes how the user will experience this non-repeatability)
Tricks for Problematic Tools
You may have a tool in your build process which does something which isn't repeatable. There are some ways to deal with this:
If you have a tool that uses the current time in producing its output, you may be able to intercept calls to get the system time. One way to do this is with the FakeTime Preload Library.
- Note that one could use the same method to intercept other system calls and insulate the tool form other things such as the hostname, process ID, etc.
If you have a tool that uses a character special device that provides random numbers from the operating system, you may be able to replace it. For example, on Linux you could supply /dev/random and /dev/urandom to a tool but make them work like /dev/zero with:
. ++= [ root/dev = [ random = 0x0105, urandom = 0x0105 ] ];
If you have a tool that tries to access the network but doesn't really need to, you may be able to artificially restrict its network access with a trick like netjail.
If you have a need for hostname to return a static value, for example "localhost", then you can create an file with executable mode bits which just echos "localhost". For example this file is called "fakehostname":
/bin/echo localhost
- Add it to your model as such
files fakehostname; { ... root ++= [ bin/hostname = fakehostname ];