Background

See MergingFuture/Food4Thought/NonTextFiles for an introduction to the problem.

The Vesta merge prototype known as vmerge (which can be found in the package /vesta/beta.vestasys.org/vesta/merging) has been through at least three stages in how it identifies and handles different kinds of files. It's continuing to evolve.

Approaches Used (so far)

At first, vmerge treated every file as a text file and used some kind of merge algorithm to combine parallel edits. An obvious problem with this approach is that it can produce unusable results when applied to non-text files. Binary files are more common in Vesta than in most other version control systems since Vesta stores the entire tool-chain used during builds. This makes this a bigger problem for Vesta than it is for most version control systems.

A set of changes contributed by BrannonBatson (see merging/5.batson_vmerge_2) incorporated in January 2008 (see merging/10) included a first cut at an automatic test for non-text files. It used the file(1) command with the -i flag to have it return a mime type. Here's the exact Python code which implemented the test:

def istext(path):
    # use the UNIX 'file' command to determine if a file is text
    find_pipe = os.popen("file -ib "+path, 'r')
    ret = find_pipe.readline()
    find_pipe.close()
    return (ret[0:4] == "text")

This was later identified as a problem for some users at Intel who were having text files identified as binaries. This resulted in the text file merge algorithm not being applied, leaving the user to compare two files and incorporate changes manually. To try and avoid this we improved the test by running file(1) twice, once with -i and once without -i. (This uses two different sets of criteria for determining the file type.) This was implemented in early April 2008 (see merging/18). This version included an "escape hatch" configuration setting which would allow users for whom this test did not work to write their own that understood all their file types and made the decision about which merge method to apply in whatever way they preferred. The default command line used was:

file -b $FILE | grep -qi text || file -ib $FILE | grep -qi '^text/'

Users can configure any alternate command, even a script or other program of arbitrary complexity, to make the determination of whether a file should be treated during merging as a text file or a binary file.

Proposal for Improvements (May 2008)

In order to further improve the ability of vmerge to treat files in the way users expect, we propose moving to a tiered system. The following methods of determining how a file should be treated will be tried in order:

  1. Explicit specification. Attributes on the enclosing package will be able to specify how an individual file should be treated.

  2. Guess from filename. A set of filename patterns may be provided in a configuration file with a corresponding file type for each.

  3. User-configured guess. A command may be configured by the user to guess the type of file.

  4. Default guess based on contents. An internal simple check of file contents to choose between treating a file as text or binary.

Each of these will be described in more detail below.

First let's state that we'll use the strings "text" and "binary" to refer to the two primary ways of treating files during a merge. Files of type "text" will be merged line-by-line using Bram Cohen's precise Codeville merge algorithm (which it has been using for some time). Files of type "binary" will be treated as indivisible pieces with conflicts whenever changes are made on both sides.

The important thing is that we want to leave open the possibility of other strings denoting other file types handled by other merge algorithms. While we don't have a plug-in API for vmerge today, we intend to support such a thing in the future. (The diffxml and patchxml tools provide one example of how we might merge changes to other file formats.)

Explicit Specification

To state that the file foo/bar.x should be treated as type Y, the user would need to add the value "foo/bar.x" to the "merge-as-Y" on the package. To do this with vattrib, they would use command like:

vattrib -a merge-as-Y foo/bar.x /vesta/example.com/dir/pkg

To be more concrete, suppose file src/foo.w is a text file and bin/bar is a binary. Those explicit specifications could be recorded with these commands:

vattrib -a merge-as-text src/foo.w /vesta/example.com/dir/pkg
vattrib -a merge-as-binary bin/bar /vesta/example.com/dir/pkg

To make this a little easier for the user, we will provide a command-line option to vmerge that will ask it to add these file type attributes for specified files. When run in a checkout working directory or an immutable version, it will find the enclosing package and add the attributes there. For example, the user could trype commands such as:

vmerge --learn-text src/*.w
vmerge --learn-binary bin/*

These would add attribute values for all files matching those patterns in the current directory on the package to which it belongs.

Finding The Attributes

  1. The algorithm to search for attributes specifying how files should be treated will start from either a checkout working directory or an immutable version.
    • If the starting directory has type mutableDirectory, it will go to the parent directory of the old-version attribute if it has one, or the parent directory of the new-version attribute if it has one, or the parent directory of the session-dir attribute if it has one. If the starting mutableDirectory has none of these attributes, then we have no explicit specification attributes.

    • If the starting directory has type immutableDirectory, it will go to its parent directory.

  2. Now the algorithm should have a current directory of type appendableDirectory. Loop until there is no current directory:

    1. While the current directory does not have package in its type attribute, set the current directory to its parent directory. If we reach the root of the repository, we have no current directory and exit the loop.

    2. For every known file type X ("text" and "binary" initially), get the merge-as-X attribute of the package. Remember these associations in a data structure mapping from relative path to file type.

      • When remembering a type specification for a file, ignore it if we already have a type specification for that file. This allows explicit specifications on branches to shadow those on enclosing packages.
      • If a file appears in more than one such attribute on the same directory we should warn the user but just pick one type.
    3. If the current directory (a package) also has branch in its type attribute and has an old-version attribute, set the current directory to the parent directory of the old-version attribute value and continue. (This proceeds from branches to their base packages looking for file type specifications.)

This could be implemented as one function that returns a list of directories to check for the attributes and another that builds the data structure of tile type associations based on that list of directories. This would probably be more useful for debug purposes.

When setting the attributes for the user, we should set them on the final current directory in the loop above (i.e. the main package, not a branch).

Ideally we should build the data structure with file type associations once for the working directory being modified and once for the immutable version being merged in. We don't want to perform this search once for each file.

Guess From Filename

We'll add a new vesta.cfg section named "Merging:FileTypePatterns". The name of each setting will be a glob(7)-style shell pattern for matching filenames. The value of each setting will be a type string (i.e. "text" or "binary"). Here's a simple example:

[Merging:FileTypePatterns]
*.[cChH] = text
*.[ch]pp = text
*.[ch]xx = text
*.gz = binary
{foo,bar}*.x = binary
*.xml = binary
*.xsl = text

The patterns will not have a defined order in which they are checked against a filename. If a filename can match multiple patterns which have different file type strings as values, the treatment of the file will be one of the type strings.

Note that because of the vesta.cfg file syntax, the patterns cannot begin with an open square bracket and cannot contain embedded equals signs.

User-Configured Guess

A new optional vesta.cfg setting [Merging]guess_file_type will allow the user to specify a command which can provide a guess of the file type. This can use any method they like, but will probably be used primarily for tests based on the file contents.

The command will be executed through a shell, simple in-line commands will be possible. The path to the file whose type is to guessed will be passed in the FILE environment variable. The command should print a file type string (i.e. "text" or "binary") as a single line of output and exit with successful status if it has determined the type of the file. If the command exits with error status, produces no output, or prints a string that is not a recognized file type, it will be treated as not providing a guess and vmerge will fall back on the next method of determining how the file should be treated.

As an example, this might be a useful setting for people working with XML files generated by a program rather than edited by a human:

[Merging]
guess_file_type = file $FILE | grep -q XML && echo binary

We'll deprecate the existing configuration setting [Merging]is_text_check in favor of this more flexible alternative.

Default Guess Based on Contents

To try and avoid scrambling binary files, as a final fall-back vmerge will use its own method of determining whether a file is likely to be a text file appropriate for merging using Bram Cohen's precise Codeville merge algorithm.

The standard commands diff(1) and grep(1) automatically determine whether a file is text or binary before printing comparison or search results by checking for any null (zero) bytes within the first several kilobytes of the file. We may use that method, or we may use a similar method that excludes other obscure control characters that don't typically occur in text files.

Scripts implementing two possible such tests are attached to this page:

File Treatment Check

We'll add another flag to vmerge which will tell you the way it will treat a file and why it chose that method:

% vmerge --check-type src/foo.w
text, explicit (in mege-as-text attribute)

It will work for multiple files:

% vmerge --check-type src/*.w
src/foo.w : text, explicit (in mege-as-text attribute)
src/bar.w : text, pattern (matches "*.w")

And without arguments it will check every file in the current directory:

% vmerge --check-type
src/foo.w : text, explicit (in mege-as-text attribute)
src/bar.w : text, pattern (matches "*.w")
src/build.ves : text, pattern (matches "*.ves")
src/bar.xml : binary, user-supplied guess ([Merging]guess_file_type in vesta.cfg)
bin/bar : binary, explicit (in merge-as-binary attribute)
lib/libbar.so : binary, internal guess
build.ves : text, pattern (matches "*.ves")