July 13, 2009

Saving disk space on branched +S files

Traceability
Branching
Integration
What's New

Recently I have been asked to provide a solution to a problem a customer was having with +S and +S<n> file types.

The customer generates very large files from almost every submit. These files are needed for testing and need to be checked into Perforce, but since each file can be 500 MB and more, only the last revision needs to be kept in the depot.

Sounds like a perfect job for the +S type, which instructs Perforce to purge previous revisions for every submit. There is only one slight catch: branches.

When Perforce branches a type +S file, it deviates from its usual practice; it copies the file to the target location instead of creating a lazy copy. Due to the large file sizes involved in this particular case this can add up quickly.

I have been asked to provide a trigger that avoids this problem. The result is RevisionPurger.py. This trigger is designed to be used as a change-commit trigger. It checks the submitted files for predecessors it can purge, taking into account any potential branches. To use this trigger, the file type of the temporary files needs to be changed back to normal binary or binary+F. Instead of an automatic purge, the trigger will remove superfluous files in question.

The trigger - as usual :-) - is written in Python using P4Python 2008.2.

You'd enable the trigger by editing the trigger table similar to this:

Triggers:
purger change-commit //....big "/usr/bin/python /triggers/RevisionPurger.py %change%"

Make sure to adjust the parameters at the top of the trigger script. Current parameters are

P4USER
P4PORT
DEBUG
REVISIONS_TO_KEEP
REMOVED_COMMENT
filterMap

The filterMap should match the trigger path (in the above example "//....big"). It can contain more than one entry.

Unlike the +S type, this trigger will not remove the file from the depot completely; instead, the trigger will replace the content of the file with a small comment. The REMOVED_COMMENT parameter defines what the file content should be replaced with. This can be set to "" (the empty string)  if you do not want any comments in the replacement file. The trigger will then adjust the stored digest to make sure that "p4 verify" does not report an error.

Example:

So, how does this work in practice?

Let's say you have enabled a trigger in DEBUG mode (to see the output) to purge revisions of all binary files with a ".big" ending. Then you create, add and submit a file called "test.big" with file type "binary" (no text files can be purged by this trigger). Nothing unusual happens.

Now edit the file and submit it again. You should now see a message from the server (if you are running p4d 2007.3+) stating something along the line

Would delete /perforce/depot/tests/test.big,d/1.1233245.gz

With DEBUG set to False, the trigger would have replaced this file in the archive with a dummy file.

Let's branch "test.big" to "test2.big" and submit, then edit and submit the original "test.big" again. Nothing happens, since the previous revision #2 has a second reference from "test2.big". So let's edit "test2.big" and submit. With the last reference to the depot file for "test.big#2" having a successor, the depot file will finally be replaced in the archive.

The script will work recursively through all integration history to determine whether a revision can be purged. Why don't try it out? Top Tip: Revision graph is your friend :-)

Caveat:

There is a situation where this script potentially removes a revision still in use, although the chances of this happening are pretty slim:

User 1: Integrates the file into a branch, not submitted yet

User 2: Updates the file, trigger finds no lazy copies, removes previous revision

User 1: Submits the integration. The branched file now points to a purged depot file with no content.

If using this trigger, you would have to make sure that branches are only created when the parent code line is frozen for the duration of the branch creation.

Alternative:

The more I thought about the problem, the more I realized that the customer could have made their life a lot easier. Instead of branching the temporary files, they could have excluded them in the branch spec. For example, assuming that the files in question end in .pak, the following branch spec would have avoided the problem of Perforce copying temporary files:

//depot/MAIN/... //depot/BRANCH/...
-//depot/MAIN/....pak /depot/BRANCH/....pak

Of course, you could create a form-out trigger that automatically adds the exclusion line to new branch specs. But this is a topic for another blog.

Happy hacking

Sven Erik