October 23, 2013

How to Version Character Sets: Solving the Problem of Having Multiple Files with Multiple Different Charsets


Image: chibilt w/Flickr

The functionality described in this blog entry is "undocumented", meaning that it is in some respects a work in progress, is subject to change in future releases, and should be treated with a bit more caution than "documented" functionality. With that disclaimer out of the way, let's take a look at how you can now version the character sets for individual Unicode files!

As of 2013.2, if you have a Unicode-enabled server, your "unicode" type files will begin versioning the charset that was associated with them at the time of creation (i.e. the P4CHARSET that was used to submit them to the depot). By default this information isn't used for anything, but by switching on this this new configurable (see "p4 help undoc"):

      server.filecharset       0 Enable per-file charset storage

the server will instruct the client to perform content translation using the charset associated with that particular file, rather than the P4CHARSET in the client's environment.

This is intended to solve the problem of having multiple files with multiple different charsets -- e.g. documentation that has been translated into multiple languages with different encodings. Previously it was necessary to either have different clients set up to work with files in different charsets, or submit them as "binary" so that no content translation would be performed at all; now it is possible to specify different charsets for different groups of files, and have a single client honor all of those charsets.

The charset can be set explicitly with the "-Q" flag on "p4 add", "p4 edit", and "p4 reopen", similar to how the "-t" flag is used to explicitly specify a filetype. Like the filetype, a file's charset stays the same from revision to revision unless explicitly changed.

A couple of things to watch out for when enabling this feature:

  1. Old clients will continue to use P4CHARSET for all Unicode translations, which is liable to make things confusing if you have a mix of client versions accessing the same files. Use the "minClient" configurable to force client upgrades.
  2. Setting the charset on a file tells the client how to handle translation for that file in future operations, but does not modify the contents of the file in place, so if you change the charset on an existing file you'll need to make sure to translate the content in your workspace to match that new charset.

We're hoping that as we see cases of how this feature helps (or hurts) the problem of handling charset differences between versioned files, we'll be able to keep the parts that work while smoothing out the rough edges. If this change interests you, try it out on a test server and let us know what you think!