How To Implement Advanced Merging For Vesta
The Vesta repository is and appendable namespace of immutable objects. Without fundamental modifications to the repository, this would suggest that whatever metadata we need store to support history-aware merging would be stored as immutable objects in the repository.
For example, suppose that we decided to use a representation based on patches/changesets. For a version N, we might store the representation of the changes over the version it was based upon (presumably N-1) in a path like changesets/N within the same directory (a packages, session, or branch). The obvious benefit of doing this is that this additional data would get all the properties of normal sources:
- Immutable storage
- Immortal names
- Replication to peer repositories
Unfortunately, immutability is somewhat at odds with some of the different methods used by other systems:
Darcs uses the rules set out in its theory of patches to perform patch commutation. This means taking an existing change and modifying it for application in a new context. Essentially this is creating a new, distinct change.
The weave data structure, which is used by SCCS and BitKeeper must be updated with each new revision.
Some other methods are more compatible with the Vesta repository's method of an appendable store of immutable objects:
- CVS and RCS both store history in files that are logically appendable logs of deltas (changes from one revision to another), though the files are not physically organized that way.
I believe GNU Arch and some derivatives may also use an appendable store for changesets, but I haven't investigated closely
Potential issues with storing deltas/changesets or some other representation of fine-grained file history as sources in the repository in this way include:
- Data bloat: most formats repeat a subset of the data stored in the source files in addition to more data. This could more than double the amount of storage space used for source files.
- Shortid consumption: All files in the Vesta repository are identified by a 32-bit number called a shortid. Storing more data in separate files with each change will increase in the rate at which shortids are consumed. (Maybe this shouldn't really be a concern, as some of the largest repositories which have been running for many years have under 2 million shortids, which is 2-3 orders of magnitude in reamining headroom.)
Should the Repository be Modified to Support Merging?
It would of course be possible to change underlying storage format of the repository. Some changes may turn out be necessary to deal with a number of tricky merging issues:
- Determining the ancestry graph which relates different versions is currently done with the old-version attributes, but those can't really be trusted as they're mutable. Internally, the repository does know which immutable directory each one is based on. A new representation which exposes that in a more accessible manner to client code could be very helpful.
- One of the better solutions to tracking renames to allowing merging across renames is to assign each file a unique ID which follows it when it gets renamed. Implementing that would definitely require modifications to the repository.
The problem with building support for a specific text file history representation and/or merge algorithm into the repository is that such representations usually work very poorly with other file formats. For this reason, some revision ctonrol systems (notably CVS) have special annotations which users add to binary files to tell the system to treat them differently. Files in Vesta have always been treated very simply (as a sequence of bytes, just like any other filesystem). It would seem like a step backwards to burden the Vesta user with additional work to differentiate binary files from text files.
Also, perhaps some users will have need of a merge algorithm that works in significantly different ways (for example, using special algorithms on XML files). It would be best if we left open the possibility of using different merging techniques for different kinds of files.
While we may need to improve the repository in some ways to support better merging, I believe that we the Vesta repository continue to treat files as raw sequences of bytes. We may want to add some form of space-saving techniques to the repository, but it seems to me that involving the internals of the repository in a history-aware merging algorithm is a mistake. That should instead be done by a repository client application.