September 8, 2011

Death by Label

Andrew Reynolds joins us as the first brave volunteer to submit a guest blog entry and has chosen a "from the trenches war story" approach that many of you Perforce admins may be able to relate to.  I encourage others to follow his brave lead and share your own war stories; Perforce has been around for 15 years so there are bound to be a few! Thanks to Andrew, and we look forward to hearing from the rest of you.

One of the pleasures of administering a P4 site is that Perforce is about as rock solid a service as you can get.

There are a few things that can kill a P4 server: losing power, p4 admin stop, running out of disk space - stuff like that. I always have an eye on my disk space and how fast the repository, db files, journals and logs are growing.

Except for that one time…

Perforce disk usage can best be described as "grow only." There are things you can do to reclaim disk space but over time you'll find yourself always asking for more space. The trick is to time it right. Well, this isn't on my resume but I missed it once and one of my group's engineers stopped by and asked what does this P4 message mean:

"write: /extra/perforce/journal/journal: No space left on device"

"Well, it means that my afternoon just got very bad and you can go home early if you want," I replied. A quick 'df' command on the server confirmed that the journal partition was full.  Even worse there were active p4d process waiting to write to the journal and p4 admin stop wasn't working so I had to kill the server hard, risking corrupting db files and the journal.

Now, first — not my fault, I had just started at the company and was taking over from "someone else," I didn't set it up this way — but the db files and the journal were on the same partition. Never do this, ever.  I went to see what could be moved or deleted to clear up disk space when I encountered something strange in the db file sizes.  The db.have file was 7 gb, normal for a site this size but the db.label file was 56 gb. What?  Most times db.label is smaller that db.have or within a few gb of each other. This was wrong, very wrong. I cleaned out some old log files and moved a little used branch in the repository to an NFS mount (I know, but it was an emergency). Finally got enough disk space to put p4 back on-line. I ran a p4d -xv to check the db files.

I wanted to figure out why the db.label file was so large but before I could deal with that I discovered that the last record of the journal was incomplete and the db.label file was corrupted. Then to really mess with my day I found the last checkpoint had been made six months before because "we were low on disk space and it wouldn't fit." Have I mentioned yet that the server hardware was really old and slow? And while I was able to get some space on the NAS, it was very slow. It was a long day and a late night backing up, rebuilding db files, replaying the journal, p4d -xv, p4 verify, etc. I'd like to paint myself as the hero in this but I did get a lot of help from Perforce support.

After the db was restored I was able to dig into the label problem.  It turned out that a rogue script on a build server was creating static labels for all builds, passed and failed, and using the label spec, "//depot/..."  We have about 2 million files in that depot.  The script was generating hundreds of labels a day and it didn't take long to fill the disk.

The short term solution was to move the journal file to a new partition, do a check point and rebuild the db files from a fresh check point. Turned out that the script was deleting labels as well as creating labels which meant that I was able to get 12 gb out the db.label file right away.

The long term solution was to identify which labels were really needed and delete the unneeded ones.  This bought the db.label file down to a reasonable 5 gb. Then there was the whole "educating the user" talk I had with the build engineers.  In addition, I put into place disk space monitoring and a p4 statics script that daily emailed me information like the size of db.have, db.label, free disk space, number of static labels, number of p4 clients and other relevant information.  The other action was to write a couple of form triggers to restrict the use of wide clients and labels that point to everything.

The real lesson is that when you take over a P4 installation, check things out and make sure things are set up right.  You never know, but there may be a reason why they needed a p4 admin…