June 3, 2010

Content Auditing in Perforce

What's New

In many environments, we need to periodically check to make sure that sensitive information is not being put into the wrong parts of the depot. For instance, if we feed our live web site from Perforce, we want to make sure that confidential information doesn’t make it into the public areas of the web site. In the defense world, we need to make sure that our source code doesn't contain dirty words. (We're not being prudish about profanity -- a dirty word in this sense violates the security restrictions on our server.)

Proactive Scanning

Perforce offers us a couple of ways to approach this task. First, we can proactively prevent sensitive content from being submitted to the wrong parts of the depot. We use a change-content trigger to scan the contents of submitted files before the changelist is committed. If the contents of the files violate any of our rules, we can prevent the changelist from being saved.

Below is a snippet of a trigger that rejects a changelist if any of the files in the changelist contain the word bazinga. The trigger is written in P4Python.

# connection details omitted for brevity…
exitcode = 0

# get trigger arguments
Changelist = sys.argv[1]

# the word we search for
Target = ‘bazinga’

# get list of files included in changelist
describe = p4.run_describe(“-s”, Changelist)

# for each file, search content for target word
# (We should probably check for binary files and
# other conditions we can’t handle…)
p4.tagged = False
for afile in describe[0]["depotFile"]:
    content = p4.run_print(“-q”,afile + “@=” + Changelist)
    if content[0].find(Target) != -1:
        print “Submit fails: ‘” + Target + “‘ found in” + afile
        print “Violates content policies”
        exitcode = 1

# handle error checking and disconnect from server, then…
sys.exit(exitcode)

We activate this trigger with an entry in the trigger table, using it on all files in our live web site depot:

    content-scan change-content //web/live/...
        "/path/to/content_scan.py %change%"

Reactive Scanning

We can also periodically scan the depots for any existing files that violate our content policies. We may do this because our policies have changed, or just as an extra precaution. Prior to the 2010.1 release, we had to use external tools for this sort of scan. We might populate a workspace and run normal file system tools like grep on it, or use Google Desktop Search. But now with the 2010.1 release (currently in beta), we can use the p4 grep (PDF) command. p4 grep runs on the server without transferring file content, so we don't need to populate a workspace with every file that we want to scan.

The p4 grep command has syntax that will look very familiar to users of the Unix grep command. Here's a quick usage example:

    p4 grep -l -a -e "my search terms" //web/live/...
    ... //web/live/foo.html#2
    ... //web/live/foo.html#1

In this case we are looking in the //web/live/... path. The -l option tells p4 grep to just print out matching file names, while the -a option lists all matching revisions. We can run these sorts of queries on a scheduled basis, taking care not to overload the server during active hours.

Once we identify file revisions that violate our policies, we can take whatever remedial measures we need. These might include obliterating file revisions in Perforce and tracking down any client workspace copies. Note that this quickly becomes a tricky clean-up problem, so it's preferable to catch the problem using proactive scanning.

For another approach to auditing using the information in the server's audit log, check out this post by Jason Gibson.