June 17, 2016

I Test, You Test

Test Management

In a previous blog post, I briefly mentioned the fact that as customers reported problems with the old v2 engine, these cases were logged in the form of detailed test cases.  More than one customer has asked how we test the integration logic, so I thought this might be a good opportunity to go into the integration test harness I use, "itest", in a little more detail.

I first became involved in the problem of how to test our integration logic when I was in technical support, serving as the main point of contact for any customer encountering problems with integration.  When customers encountered problems that didn't have workarounds, I would attempt to advocate for specific fixes; sometimes in the form of small tweaks that would solve a particular problem, sometimes in the form of more fundamental changes that seemed like the only way to solve the more fundamental problems.  Development was hesitant to attempt to implement any of these suggestions, largely because even if they were good suggestions (and one or two of them might have been), it would be difficult to verify their efficacy.

Before itest existed, the body of tests for integration (and everything else) worked by running a large set of commands and diffing the output vs a large set of canonical output. Adding a new test in the middle of this suite and making sure all of the tests that followed it still functioned correctly could be nontrivial. Testing the semantic effect of a change that touched a large number of systems could also be nontrivial -- if a low-level change caused a set of commands to produce functionally the same result but with a different number of changelists, or selecting a different (but identical) revision as a base, the entire test would "fail" and a large amount of human effort would be needed to see whether the change worked as intended in all of those cases.

Add into this state of affairs the situation when we did make a few fundamental changes to the v2 algorithm and customers began to report behavioral differences.  The changes in question are described in this 2007.2 relnote:

---

#119955 (Bug #23698, #24251, #24207, #23469, #24150) **

The 'p4 integrate' algorithms for suppressing reintegration and for picking the optimum base, which were reimplemented in 2006.1, have been tuned significantly for this release. The following new changes have been made:

Integrating a single 'copy from' revision now gives credit for all earlier revisions, so that a subsequent 'p4 integrate' of any earlier revision will find no work to do. This can only come about by 'cherry picking' (providing to 'p4 integrate' specific revisions to integrate).

Pending integration records (files opened with 'p4 integrate' but not yet submitted with 'p4 submit') are now considered when finding the most appropriate base. This makes integrating into a related file already opened for branch possible without the 'p4 integrate -i' flag.

'p4 integrate' follows indirect integrations through complicated combinations of merge/copy/ignore integration records better. This should result in fewer integrations being scheduled, and closer bases being picked, for integration between distant files.

'p4 integrate' could wrongly choose a base on the source file after the revisions needing to be integrated if the revisions needing to be integrated were before revisions already integrated. This normally only comes about in cases of 'cherry picking' (providing to 'p4 integrate' specific revisions to integrate).

'p4 integrate' in certain cases wouldn't find a base (or choose a poorer base) if the source file was branched and reopened for add, and then the original file was changed further and branched again.

---

As customers reported that merges in some cases had become more difficult after the upgrade to 2007.2, I wanted to get hard data to determine if there had been a verifiable regression in behavior and possibly make a case to development to roll some of the changes back, so I got a lot of practice at looking at the data I had available (mostly screenshots of Revision Graph) and coming up with sets of commands that would reconstruct the same scenario -- rather than logging a bug report that said "some merges are different after the upgrade", or including a full customer checkpoint as an attachment, I had something that development could simply copy and paste into a shell to recreate the situation that the customer was experiencing.

I started writing the itest.pl test suite while I was on the long flight back from the 2008 European User Conference; we had just had a long conversation about the possibility of making deep changes to the integration algorithm to cut down on conflicts, and the problem had come up of not having adequate test data to verify that sort of a deep change.  The following requirements were foremost in my mind:

  • It needed to be easier for me to generate test cases than my then-current method of typing out individual shell commands.
  • The tests needed to be able to check for semantic correctness rather than relying on byte-for-byte equality of output.
  • I needed to be able to easily run the same set of tests in different environments to report on differences between versions.

The script started with a simple method to generate text files with a sequence of non-conflicting edits that I could use to arbitrarily produce clean merges at will (try doing that in a simple batch script with "echo" statements and redirects; it's torment), and quickly evolved into a language that was able to represent everything in my library of batch scripts with a very small fraction of the typing.  This is an example of a test script written using the itest tool:

---

add X

branch X Z

branch X Y

edit X 1 B1

edit Y 1 B2

edit Z 2 D

dirty Y X 1 B3

merge Z Y

merge Z X

test base X Y Y#3 Y#2 Z#2 X#1

---

In English this would read as: "add a file called X, create branches Z and Y, edit different text into X and Y at a common location to force a conflict, and something else into Z at a non-conflicting location.  Do a dirty merge from Y to X, resolving the conflict with yet another edit in the same location, and then do clean merges from Z to both Y and X.  When merging from X to Y, the ideal base is Y#3, with other acceptable (but not ideal) bases being, from best to worst: Y#2, Z#2, and X#1."  When executing this script, the output will be a letter grade from A (for the ideal base being chosen) to F (for none of the presented options being chosen).  The 2006.1 server receives a grade of "C" on this test; the server as of 2013.2 receives an "A".

Once I had migrated all of my existing test cases into this tool, I was able to very quickly create a table (with color coding and everything) demonstrating differences in behavior between releases -- cases where 2006.1 was a regression from 2005.2, and cases where 2007.2 was a regression from 2006.2, as well as cases where the newer release was a step forward.  As time went on, I continued adding more and more data to this test suite, and slowly started to assemble a picture of what an algorithm that addressed every case at once might look like.  Having so many examples of cases that could "fool" each existing approach into picking a demonstrably non-optimal answer made it much easier to think about what approach we could use to produce an optimal answer in each instance.

These examples also served me very well when I began work on a prototype for a system that I thought might produce that optimal answer; once my prototype was able to produce right answers to everything that the current software got wrong, it was easy to make a case to development that it was worth working to improve or rewrite what we had.  A few years later, I myself had been absorbed into development to take over that task, and I hope this serves as a cautionary tale to others about the dangers of writing demonstrably useful prototypes, even just for fun.

Here's an example of a table of test output, with the same set of tests run across three different server versions:

Thumbnail

The current version of itest.pl can be found in the Workshop.  Some time ago I decided to put it out there so that customers who were experiencing problems with integration and wanted to document those cases for themselves, the same way that I used to when I was working directly with customers, would have access to the same tool that I used.

More recently I've submitted a selection of the test cases that we have assembled over the years.  This isn't the full suite (I've removed a large number of cases that are based directly on customer data, leaving only the more abstract cases) but hopefully it provides an idea of the breadth of situations that we test and develop for.

As promised in my last post, I will soon be getting to a description of the current integration engine, which is based heavily on the prototype I mentioned in this post.  Stay tuned!