Bridge Dissection

Introduction

This document is here to help explain how Vesta bridges work. You might want to read it if you need to write bridges, or if you just use them and you want to understand how they work.

We'll be taking a detailed look at the bridge for the standard lexical analyzer generator lex(1). If you're not familiar with it, you might want to take a quick look over the lex man page. All you really need to know, however, is that it takes in a description file and produces a C source file.

In Vesta, this bridge can be found in the package /vesta/vestasys.org/bridges/lex. This document was written based on lex/0.

This document assumes that you are already familiar with basic Vesta concepts (packages, the checkout/checkin process, vmake, etc.) and the Vesta evaluator language (on a basic level). If you're not, you should probably go through the Vesta tutorial and the Vesta SDL walkthrough first.

An Overview of the Bridge Model

Before getting into all the details, lets take a quick look at the bridge model to see what's inside. Here's the full text of the bridge:

// Copyright (C) 2001, Compaq Computer Corporation
// 
// This file is part of Vesta.
// 
// Vesta is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
// 
// Vesta is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
// Lesser General Public License for more details.
// 
// You should have received a copy of the GNU Lesser General Public
// License along with Vesta; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

// Created on Wed Apr 23 09:56:40 PDT 1997 by heydon
// Last modified on Sat Sep 25 10:13:59 EDT 2004 by ken@xorian.net
//      modified on Thu Feb  5 00:53:13 PST 1998 by yuanyu
//      modified on Wed Jun 11 22:42:32 PDT 1997 by heydon

// bridges/lex/build.ves -- the lex(1) bridge model

{
// Parameters specializing this bridge to the target platform -----------------

  // The command to invoke.
  command = ./command;

  // The root filesystem to use for this platform (which must include
  // the executable named by "command").
  root = ./root;

// Functions exported by the bridge -------------------------------------------

    lex(/**pk**/ in_file: NamedFile, output_name: text = "lex.yy"): NamedFile
    /* Invoke lex(1) on the file "in_file", returning a binding
       that maps the name "output_name" to the output generated by
       lex. The argument "output_name" should not have an extension; a
       ".c" extension is automatically added. */
    {
        // form lex(1) command-line
        cmd =
          <command>        		             // name of executable
          + <"-t">              		     // "-t" writes to stdout
          + ./generic/binding_values(./lex/switches) // client switches
          + <_n(in_file)>;                           // input file

	// augment "." to include the lex executable and files required by lex
	. ++= [ root ];

	// add "in_file" to working directory
        . ++= [ root/.WD = in_file ];

        // invoke lex
        r = _run_tool(./target_platform, cmd,
          /*stdin=*/ "", /*stdout_treatment=*/ "value");

        // construct return result
        return if _assert(r != ERR && r/code == 0 && r/signal == 0,
 			  "lex failed")
	       then [ $(output_name + ".c") = r/stdout ]
	       else ERR;
    };

// The bridge result itself --------------------------------------------------

    bridge_name = "lex";
    switches = [];
    return [ $bridge_name = [ lex, switches ] ];
}

On lines 30 and 34, the "command" and "root" variables are set. Theses are both taken from binding lookups within the special variable "." (also called "dot"). These parameters are used to specialize the lex bridge to a particular platform.

The value of command is used exactly once, on line 46, and the value of root is used exactly once, on line 52. The way these variables are used assumes that command is a text value that specifies the path of the lex executable, and that root is a binding representing a root filesystem with all the files needed to run lex. (Remember: bindings and directories are essentially interchangeable.)

Lines 38-66 define a function named "lex", which happens to be the central piece of the bridge. It accepts two parameters:

An input file for lex. (Note that this parameter is given a type of NamedFile, one of a collection of predefined types documented in the vtypes(5) man page. It is possible to define more type names yourself, and if you look at other bridges you will come across this technique.) This parameter is used once on line 49 and once on line 55.
The name the output file to be generated. (This parameter is declared to be type text.) The default name is "lex.yy". This parameter is used only once, on line 64.

The lex function returns a single NamedFile as output, which is the C source file lex generates.

Lines 70-72 define the return value for the bridge model. This result fully defines the bridge and everything it needs to do its job. Normally, a package like std_env collects the results of different bridge models in order to construct the value of dot used by most models.

The `lex` Executable

You may be wondering why the bridge gets passed a binding containing, among other things, the lex executable. The reason is simple: every single file that contributes to a build in any way, including all source files, libraries, and compiler executables must be checked into Vesta. (To be more precise, there must be an immutable copy in the appendable portion of the repository; checkout session versions are just as acceptable as checked in versions.) This is part of how Vesta provides one of its fundamental guarantees: that every build will be exactly repeatable forever. Imagine what would happen if Vesta just used /usr/bin/lex (instead of one in a package). If the installed version of lex were upgraded to a new version, suddenly the precise results of a build could change. Perhaps a build that worked in the past now even fails (if the new version of lex is sufficiently different). Keeping tool executables under the same version control as sources prevents this problem. Just as a build refers to a precise set of source versions, it also refers to a precise set of tool versions.

Why does this bridge take the root filesystem as a parameter? Why not just include the lex executable and its associated files in the package and get them with a files clause, or store the files in another package and use an import to get them? The problem with that approach is that would make the bridge specific to only the platform of the included executable. If it was an i386 Linux executable, you'd need a different bridge on PowerPC Linux (or Alpha Tru64 UNIX, etc.). Making the filesystem and command path arguments makes the majority of the lex bridge reusable across platforms.

The `lex` Function

Now let's dig into the real meat of the bridge: the lex function.

Preparing the Command-Line

The first part of the body of the function constructs the command line used to invoke lex:

// form lex(1) command-line
        cmd =
          <command>        		             // name of executable
          + <"-t">              		     // "-t" writes to stdout
          + ./generic/binding_values(./lex/switches) // client switches
          + <_n(in_file)>;                           // input file

This assignment statement constructs a list of text values which will be used as the command line when invoking lex. The value assigned to cmd is constructed by concatenating together several small list values with the + operator. The angle brackets enclosing several of the sub-expressions turn them into lists (which is one way to make a list).

The first such list is made up of just one value, the variable command which was defined earlier. This holds the first element of the command-line to be executed: the command to execute. It's worth noting that the value of command is referenced in the function, but not explicitly passed into it as a parameter. Since functions in the Vesta evaluator language are closures, they capture the values of variables in the environment where they are defined. All the variables defined before the function in the scope of the overall bridge model (root and command) are captured by the function definition and carried with it.

The second list is made up of a single text literal: "-t". This command-line flag tells lex to write the C source it generates to standard output rather than to a file. A little later we'll see that the bridge captures the standard output of lex and returns it in its result.

The third list is generated by a call to the function ./generic/binding_values, passing it ./lex/switches. (Remember, dot is usually a binding containing the different bridges and their associated data, constructed by the std_env model. The expression ./lex/switches is a nested binding lookup: it looks up the value bound to switches in the binding bound to lex in the binding assigned to dot.) This is a common mechanism in bridges allowing users to specify arbitrary additional command-line arguments to pass to a tool. ./lex/switches is a binding where each bound value is of type text. ./generic/binding_values takes a binding and returns a list of all the values from the name/value pairs making up the binding. In other words, this expression turns a binding of named command-line arguments into a list of unnamed command-line arguments. For example, if you wanted to pass the -v flag to lex (which requests a summary of the generated finite state machine statistics), you might do something like this before invoking the lex bridge:

. ++= [ lex/switches/stats = "-v" ];

Then, when the lex bridge was invoked, it would gather your flag and any others in ./lex/switches up into the list it's storing in cmd. through ./generic/binding_values. (If you're interested, you can find the definition of the function ./generic/binding_values in the package /vesta/vestasys.org/bridges/generics.)

The last list appended into the value for cmd is the name of the input file. This is obtained by applying the _n primitive function to the in_file argument.

Preparing the Filesystem

The next part of the body of the lex function deals with preparing the filesystem to be used by the tool invocation:

51 52 53 54 55	// augment "." to include the lex executable and files required by lex . ++= [ root ]; // add "in_file" to working directory . ++= [ root/.WD = in_file ];

It's important to remember the principle of filesystem encapsulation here. When we invoke the lex executable by running the command line we've built up in cmd, the binding stored in ./root will completely define the entire filesystem the tool sees. This bridge depends on its caller (normally the std_env model) to provide a filesystem complete with the lex executable, as well as any other support files it needs (such as shared libraries like the C run-time library). In order to run lex and process the input file, we need to combine the bridge parameter root and the lex function parameter in_file to make the filesystem which will be used.

The statement on line 52 takes the bridge parameter root and merges it into ./root. Note that the recursive overlay assignment operator is used, which will merge the binding on the right side with the current value of dot. This leaves any other files and directories already in the filesystem defined by ./root in place. The bridge could use the non-recursive overlay operator here, relying on the bridge parameter root to include absolutely everything necessary for running lex. However as it's written, the std_env model can place some files that most every tool needs in ./root and have tools like lex inherit them by default.

On line 55, the input file is placed in the working directory. in_file is a binding, and here we again use the recursive overlay assignment operator to merge it into ./root/.WD. (This is the default working directory when running a tool, but a different working directory can be specified with an option when invoking the _run_tool primitive function.) Placing this file in the working directory for the tool invocation means that it can be referred to with just a filename, which is exactly how it is specified in the command line generated in cmd.

Running the Tool

After setting up the command-line and the filesystem for the tool invocation, the next step is actually running lex:

57 58 59	// invoke lex r = _run_tool(./target_platform, cmd, /stdin=/ "", /stdout_treatment=/ "value");

The first parameter to _run_tool tells Vesta what platform this tool should be executed on. This is a text value that's used to look up a list of hosts that can be used to run the tool in the Vesta configuration file. A typical value is "Linux2.4-i386", (meaning an Intel 386 or higher running Linux kernel version 2.4 or higher). (The vesta(1) man page describes more about host _run_tool host selection works.) The value of ./target_platform is used, which is normally set up by std_env and is uniform across an entire build (which is what you would expect).

The second parameter is just the command-line which was built up earlier.

Those are in fact the only two parameters required by _run_tool. While it can accept up to 10 parameters (11 if you count the implicit parameter providing the value for dot), all but the platform and the command-line have default values.

The third parameter specifies the standard input for the tool invocation. Since files and text values are interchangeable, this is just a text value. In this case, the same value as the default is provided (the empty string). This is necessary, because we're going to provide a non-default value for the fourth parameter. Since the evaluator's defaulting mechanism has the same semantics as C++ (only omitted trailing parameters get default values), a value must be given for the third parameter in order to get to the fourth.

The fourth parameter tells _run_tool what to do with the standard output of lex. The default value is "report", which means "make the output visible to the user by sending it to the standard output of the evaluator". We override this behavior by passing "value", which means "capture the standard output as a text value and bind it to the name stdout in the binding returned by _run_tool". (There are several other possible values for the stdout_treatment parameter.) The important thing to note is that we told lex to send the C code it generates to standard output (with the -t command-line flag) and that will be captured as a text value.

Preparing the Tool Result

After the tool is invoked, the final step is to prepare the result of the lex function:

61 62 63 64 65	// construct return result return if _assert(r != ERR && r/code == 0 && r/signal == 0, "lex failed") then [ $(output_name + ".c") = r/stdout ] else ERR;

The first thing to note about this is that it is a single return statement with a complicated if expression.

The test in the if expression is a call to the _assert primitive function. If its first argument is TRUE, _assert simply returns TRUE. Otherwise, it halts the evaluation with a run-time error printing its second argument. The first argument is a boolean expression which checks to make sure that the tool was invoked successfully and that the tool itself didn't return a non-zero exit status (indicating an error) and wasn't terminated by a non-zero signal (such as that produced by a segmentation fault). (This could happen if there is no known machine matching the target platform specified to _run_tool or if the command-line cannot be invoked, due to a problem like the executable not being present in ./root, the executable not havaing the right file format, or a needed shared lirbary not being present in ./root.) If _run_tool doesn't return ERR, it returns a binding with details of the result of the tool invocation. The tool exit status is stored in r/code, and the signal that terminated the process (or 0 if the process exited voluntarily) is stored in r/signal. If any of these three possible error conditions is detected (the tool couldn't be invoked, the tool returned a non-zero exit status, or the tool was terminated by a signal), _assert will halt the evaluation.

To be syntactically correct, an else clause is still required in the if expression. The function returns ERR in that case (which will never be reached in normal usage).

Assuming that there wasn't an error, the lex function returns a singleton binding. The name in the binding is derived from the specified output file name (from the output_name parameter) with an extension added indicating that it is a C source file (".c"). This uses a binding syntax which allows the bound name to be generated by an expression returning a text value. The value of the binding is the standard output of the lex command, which was stored in r/stdout because we requested it by passing "value" for the stdout_treatment parameter of _run_tool.

The Bridge Result

The final piece of the lex bridge model prepares its own result. (Remember: every system model is itself a function.)

68 69 70 71 72	// The bridge result itself -------------------------------------------------- bridge_name = "lex"; switches = []; return [ $bridge_name = [ lex, switches ] ];

The result is a singleton binding, which binds the name "lex" to another binding which contains everything the lex bridge needs to operate correctly. Within the return value:

lex/lex is the central lex function which compiles a lex source file and returns the generated C source file.
lex/switches is an empty binding. This can be modified by a model invoking ./lex/lex to add flags to the lex command-line. It is referenced as ./lex/switches when building the lex command-line.

Normally, the lex bridge model is used by a model such as std_env which collects together many bridges and makes them available in dot. In fact, the lex bridge assumes that its result will be merged into dot before it is invoked. (It expects to be able to reference portions of the result through ./lex/switches and ./lex/bin.) The std_env models for most Linux platforms use a process something like this to build the value which other system models use for dot:

from /vesta/vestasys.org/bridges import
  lex/0;
  // ...import other bridges, libraries, etc....
{
  // Initialize dot to the empty binding
  . = [];

  // Merge the lex bridge into dot
  . ++= lex([ command = "/usr/bin/lex",
              root = ./build_root(<"flex", "glibc">) ]);
  
  // ...add other  bridges, libraries, etc....

  // Return dot with all the bridges, libraries, etc. included
  return .;
}

Here the function ./build_root combines root filesystem pieces taken from OS component packages (e.g. RPMs or Debian packages). In this example, all that's needed to run lex is the "flex" package (which contains /usr/bin/lex) and the "glibc" package (which contains the C run-time library, which is needed to run the lex executable).

(Of course the above is a simplification of what actually appears in the real std_env model, but it's essentially the same.)

How it All Works

For the sake of review, let's look at how all this fits together in a real use of the lex bridge. Let's suppose we have a build package which uses the lex bridge. It has a top level model linux_i386.main.ves and a build.ves file. The linux_i386.main.ves imports std_env to generate dot, makes a modification to it, and then invokes build.ves to do the real work of the build:

import
  self = build.ves;
from /vesta/vestasys.org/platforms/linux/redhat/i386 import
  std_env/8;
{
  // Construct the standard environment
  . = std_env()/env_build();

  // Have lex generate a "fast" scanner
  . ++= [ lex/switches/fast = "-F" ];

  return self();
}

Note that in the value of dot from std_env, ./lex/switches is an empty binding. (This comes directly from the result of the lex bridge model.) We augment that on line 10 by adding one name/value pair to it with the recursive overlay assignment operator. We know that the lex bridge will pick this up when building the command-line. (It's best to do this sort of thing in a platform-specific top-level model, as that avoids putting potentially platform-specific command-line options in a model that could be used for multiple platforms, like the build.ves below.)

Once we've generated and augmented our value for dot, it will be implicitly passed into this package's build.ves model when we call it on line 12. Now let's look inside that model:

files
  lex_src = [ lexer.l ];
  c_src = [ main.c ];
{
  // Invoke lex
  lex_c_src = ./lex/lex(lex_src, "lexer");

  // libraries to link against
  c_libs = ./libs/c;
  libs = < c_libs/lex, c_libs/libc >;

  // Compile the program
  return ./C/program("foo", c_src + lex_c_src, /*headers=*/ [], libs);
}

On line 6, the lex bridge is invoked to process a lex source file (lexer.l, brought in through the files clause on line 2). The central function is ./lex/lex., right where the bridge result placed it. The result of invoking it is stored in the variable lex_c_src. Assuming there were no errors, this will hold a singleton binding, with the name "lexer.c" mapping to the C source produced by lex. Later, on line 13, lex_c_src is combined with c_src (which contains main.c which was brought in through the files clause) with the binding overlay operator. The result of this expression is a binding of two C source files, which is passed to the C bridge function program to build a complete executable.

When the lex bridge is invoked, our modified value for dot is implicitly passed into it. Our addition to ./lex/switches is used to generate the lex command line.

Kenneth C. Schalk <ken@xorian.net> / Vesta SDL Programmer's Reference