What is Rename/Identity Tracking?

Suppose there are two users working in parallel on the same project. Alice makes some edits to file A. Bob also makes changes to file A, but also renames file A to B. Now they would like to merge the changes they made in parallel. Ideally, the merge tool would know to combine the edits both users made in the file which is now named B. However, unless the system knows that Bob's B and Alice's A are really the same file, it's unlikely that an automatic merge tool could do this.

This is of course a fairly simple example, and much more complex ones are also possible. On a long-lived branch a file could be renamed multiple time. Entire directories could be renamed.

Most people call the feature of remembering rename operations "rename tracking". It could also be called "identity tracking" as the system is remembering the identity of a file/directory regardless of how many times its name and location have changed.

Why Rename/Identity Tracking is Important

There are two main situations which keeping track of file/directory identity across rename operations allows a merge algorithm to support:

Some people argue that these don't come up very often and aren't worth supporting. In some mature projects that may be the case. However, younger projects tend to undergo a larger amount of flux and re-factoring which requires renaming.

In a Vesta context, such restructuring might currently turn into package re-arrangements (migrating files between different packages, combining packages, splitting packages, etc.). (See also MergingFuture/Food4Thought/PackageSuturing for some thoughts on how file identity and merging could be used across that sort of restructuring.)

Another important issue to consider is that in some cases file naming has semantic meaning in the target language. Java, Perl, and Python all have this property (and they are certainly not the only ones). When working with such languages, there are certain source-level changes which require file renaming.

Possible Implementations

There are at least two ways to store records of file identity:

  1. Record rename operations in each version's set of changes over the previous version.
  2. Assign each file/directory a unique identifier when it is first created which stays with it across rename operations.

These are essentially equivalent in terms of the information stored because you can derive one from the other. If each version explicitly records rename operations, unique identities can be assigned by walking the revision graph. If each file/directory has a unique identifier, two versions can be compared to find which renames occurred to transform one into the other.

However, there are compelling reasons to choose each and it may be best to implement both.

When performing a merge operation, having unique identities recorded avoids the need to compute identities by walking the revision graph. Computing identities this way could be a costly operation, depending on how much history must be traversed.

Two users at two separate, disconnected repositories could create a file with the same name at the same time. Arguably they should be treated as two different files. One option would to assign identities in a way that attempts to ensure they are globally unique. An identity could be something like a fingerprint or other hash of several things which would make the probability of collision vanishingly small, but that would make the identity a significant amount of data to carry around.

Another option is to have the stored identities be local to the repository (similar to shortids). However this means that when communicating with more than one repository (e.g. during replication), the stored file/directory identities cannot be used. Instead one would want to communicate in terms of recorded rename operations.

Maybe we could combine these two approaches for storing identities. We could have a globally unique identity that's some long piece of information like a fingerprint. To avoid storing this in every directory entry, we could have a global data structure that maps global identities to local identities (sequentially-assigned integers). Each directory entry could then store the small local identity. When replicating from peer repositories, we would map the source repository's local identity to a global identity and then map that global identity to a local identity (possibly assigning a new local identity) in the destination repository.

Interaction with History Replication

The choice of how to store file/directory identity and communicate it between servers when replicating has a strong relationship with history records are copied between repositories. (See MergingFuture/Food4Thought/HistoryManagement.)

If identities are tracked primarily by explicitly recording renames between versions, then you must replicate the history of a version in order to determine the correct identities of the files/directories it contains at the destination. The chain or edit/rename/create operations are the only things tying together the identity of the files/directories, so if you don't have that you can't match up files/directories which are logically the same.

If identities are instead globally unique, then it's possible to match up file/directory identities across a fractured history.

monotone

ToolComparison/Monotone records renames in both of the ways described above.

The record monotone calls a "revision" is used to store and communicate a version's changes over previous versions. Each monotone revision record includes rename operations over the previous versions.

There's a separate format called a "roster" which is used internally to store versions but not to communicate between repositories. It includes file/directory identifiers as integers. These are sequentially assigned in each repository.

This dichotomy exists for the reasons outlined above: local operations can quickly match up file identity across versions to be merged, and yet the identifiers are local to each repository guaranteeing uniqueness without using a large amount of storage.

monotone does not have any concept of a global file/directory identity. The only identitfers they have are the repository-local ones.

Monotone seems to have a particularly well disciplined approach to handling file/directory identity and directory structure merge operations.

Rename/Identity Tracking in Vesta

When a rename is performed in a working copy, the Vesta repository currently records it as a deletion of the old object and a creation of a new one. In other words, Vesta lacks rename/identity tracking.

Probably the easiest way for us to implement this would be to record file/directory identities with each directory entry in each immutable version. To record rename operations would require introducing a new data structure which represents the entire version rather than an individual directory (as rename operations can move across directories with a package). Another benefit of simply attaching an identifier to each file/directory is that moving a file or directory into another package would have its identity follow it. (See also MergingFuture/Food4Thought/PackageSuturing.)

There are some complications with implementing this. It would be ideal to simply record all rename operations from the filesystem interface as rename operations. However, some editing mechanisms (notably Emacs) will rename the file being edited and create a new file with the original name (e.g. with Emacs renaming "foo" to "foo~" as a backup and then creating a new "foo"). In cases like this, we would prefer not to act as though the original was deleted and a new file was created with the same name.

The approach taken by virtually every other version control system is to have the user explicitly tell the version control system "I'm renaming A to B" with a special rename command which records additional information for the version control system. It would be nice if we could avoid that, but it's not clear that there's a method which will correctly distinguish between a rename performed by the user and one performed on their behalf by a tool like an editor.

We might be able to get a 90% solution by simply leaving new files without an identity until they are first placed in an immutable snapshot. Then at that point we can look for cases where a file/directory with no identity has the same name as one in the old version and no other object being placed in the snapshot has that identity. In cases like that, we could assign the new file the same identity as the identically named file in the old version. However it seems that this method could still make mistakes. Suppose the user renames A to B, then edits B with an editor that does a rename/create to replace the file, then creates a new A that should be a different file.