Criss-Cross Merge
One kind of merging implemented by many revision control systems is referred to as "3-way merge". This is because it requires three input files/directories:
- The working copy to be modified
- The state with changes to be merged into the working copy
- The common ancestor of the two
3-way merge incorporates all changes that lead from the common ancestor (3) to the other version (2) into the working copy (1). (Some systems have merge operators that act on two committed states, rather than on uncommitted working copies; the advantage is that local changes will not be lost.) Any cases where the same portions of the file are changed in both new versions are flagged as conflicts for the user to resolve. CVS and Subversion both use 3-way merge, as do several other systems.
3-way merge is significantly better than 2-way merge. 2-way merge means simply comparing two files and asking the user to resolve any differences between the two. 3-way merge can handle more cases automatically by detecting when parallel edits don't modify the same portions of the file.
Unfortunately, long-lived branches create conflicts which 3-way merge cannot automatically resolve. The simplest example of how 3-way merge breaks down is the criss-cross merge, which Bram Cohen explained in a post to the git mailing list. The basic problem is that there is no single common ancestor for the whole file. Or, to put it another way, on successive merge operation different parts of the file should be treated as having different common ancestors.
Bram's post doesn't include the complete file contents, so let's go through a simple example of how things go wrong. To make the text shorter, we'll use a 5 line file with parallel changes on lines 2 and 4 (rather than Bram's 10 line file with changes on lines 3 and 8). Here's the initial version A:
1 2 3 4 5
Starting from this state, one user creates version B:
1 2 3 4b 5
And another user creates version C:
1 2c 3 4 5
Both users decide that they like what the other has done and that they want to merge it into their working copy. 3-way merge handles this nicely, with A being the common ancestor. Both users get this in their working copy:
1 2c 3 4b 5
However, this state does not become the basis for new branches for our two users. In fact, it might not ever exist in any version at all, as it could just be in their working copy.
As in Bram's diagram, our two users proceed on their own branches after merging, each continuing to modify their own distinc portion of the file. The first user creates version D:
1 2c 3 4d 5
And the second user creates version E:
1 2e 3 4b 5
Now our two users wish to merge again. Unfortunately, thw only common ancestor is A. If we do a 3-way merge using A as the common ancestor, both sides seem to have modified both lines 2 and 4. This would result in conflicts on both lines. With CVS-style conflict markers, it would look like this:
1 <<<<<<< D 2c ======= 2e >>>>>>> E 3 <<<<<<< D 4d ======= 4b >>>>>>> E 5
This is unfortunate, because it forces the user to resolve more conflicts. In fact, these are both false conflicts, because the two users were editing different parts of the file. Ideally, the user should only be presented with true conflicts: cases where two users changed the same section of a file in different ways in parallel to each other.
We could use B or C as the common ancestor, since the current version includes all the changes in each of those. In this simple case it does eliminate one of the two conflicts. However in general it's not possible to determine which possible common ancestor will result in fewer conflicts without evaluating the merge once for each possible common ancestor. There's also another unfortunate problem with this approach: if the two branches change the same line in different ways, using previously merged versions as common ancestors can silently drop changes which should be treated as conflicts. (More on this in a moment.)
Of course the ideal solution would be if the two users truly re-converged and created new branches based on the merged version. However this puts a burden of synchronization on developers, and may not even be possible depending on the situation. (Imagine, for example, collaborating with people that have a 12-hour timezone difference from you, or users working on their laptop while disconnected from the network.)
There is another, worse case. Suppose that we have a common ancestor A:
this is A
Suppose Bob now edits this one line file to make state B:
this is B
And Carol, in parallel, makes state C:
this is C
Now Bob and Carol both, in parallel, merge these two states, and naturally get a conflict. Bob, of course, thinks his changes are far superior, so his merge, D, is identical to B. Carol, likewise, commits a merge E which is identical to C.
Now the graph again has two heads, D and E, which have to be merged. Obviously, since Bob and Carol disagreed on how to resolve the previous merge, the system should not pick either side; it should pass the buck to the user. There are two least common ancestors -- B, and C. Here's where things go wrong -- if B is used as the common ancestor, then E will silently(!) win the merge. Contrariwise, if C is used as the common ancestor, then D will silently(!) win the merge. In this case, one can instead use A as the common ancestor, but this approach creates its own problems -- using old ancestors will create spurious conflicts in other parts of the file, which experience with Monotone suggest are bad enough to be unacceptable to users; worse, it makes it impossible to properly handle merging of renames. These problems are discussed on the monotone list: criss-cross merge, 3-way merge considered harmful.
This problem is the essential basis of what is commonly called history-aware merging. That is, making use of finer-grained information about the operations which created the current state of a source file. In our first example here, a history-aware merge algorithm might:
Treat B as the common ancestor of line 4
Treat C as the common ancestor of line 2
Treat A as the common ancestor of lines 1, 3, and 5
This would result in no conflicts. In the second example, it might be able to use A as the ancestor for merging that particular line, while still using closer ancestors for other lines.
Some systems which implement history-aware merging include:
Does Criss-Cross Merge Matter?
It's interesting to note that GNU Arch (a distributed revision control system developed in the last few years, which has spawned several derivatives) does not include features to avoid the false conflicts described in the example above. This is essentially outlined in the Arch tutorial's description of merging between different branches, using what they call "star-merge", where it says (emphasis added):
Whenever we have a such a diagram in which none of the merge lines cross, that is a simple development branch.
In spite of this, some Arch users ran into this limitation and sparked a discussion on the gnu-arch-users mailing list. In that discussion, Tom Lord (Arch's original developer) called this use-case "a user error".
Essentially what this means is that Arch has defined its merging capabilities such that it's reasonable to merge back and forth between branches only when criss-cross merging doesn't happen. To me, this seems like a user-hostile position to take, as this is clearly a situation which users can run into without realizing it. Even if the system could detect this case and prevent the user from getting themselves into trouble (which seems unlikely in cases of disconnected operation), it seems to me that it would be better to recognize that this is a valid kind of user-level operation and support it as best as possible.
Merge Across Rename
Another merging complication is what happens when files and directories get renamed. Suppose we start with a file named "foo" in an initial version A. Two versions are derived from A:
In B the contents of file foo are changed
In C file foo is renamed to "bar"
Now suppose we want to merge the changes made in these two versions. We'd like to incorporate the changes made in B into the file in it's new name.
This problem is also described in the RFE titled "Tracking renames".
Persistent File IDs
One common solution to this problem which is used in several other revision control systems (notably BitKeeper and GNU Arch) is to assign each file a unique ID when it is created. This ID remains the same throught its entire history (regardless of edits to the file) and follows it when it gets renamed. This means that it is possible to match up files in diferent revisions (in time proportional to the number of files in the versioned directory), regardless of how many rename operations have happened between those revisions. (Without persistent file IDs, it may be necessary to traverse a significant portion of the history looking for rename operations, which may not scale well.)
This particular problem can be viewed as a subset of the problem of merging directory tree structure, which has recently been discussed on the monotone development mailing list. (There are several other tricky and subtle problems in merging directory structure, such as the possibility of conflicting renames, or merging creating directory loops.) They seem to have come to the conclusion that persistent file IDs are a good idea.
The current Vesta implementation does not store any identifiers which could be used for this purpose. (A file's shortid and fingerprint change with edits.)