Internationalization Notes for P4D, the Perforce Server and Perforce client applications January 3rd, 2006 Introduction The Perforce clients and server have an optional mode of operation where all metadata and some file content are stored in the server in the UTF8 Unicode character set and are translated into another character set on the client. When running in internationalized mode, all non-file data (identifiers, descriptions, and so on), as well as the content of all files of type "unicode" are translated between the character set specified by the P4CHARSET variable on the client and UTF8 in the server. Server configuration Before you use Perforce in an internationalized environment, you must first instruct your server to run in internationalized mode. Setting your server to run in internationalized mode: 1. Run "p4d -xi" This will verify that any and all existing metadata is valid UTF8 and set a protected counter "unicode" to instruct future invocations of p4d to operate in internationalized mode. (The "p4d" process invoked by "p4d -xi" does *not* start a Perforce server; rather, it terminates after instructing future invocations of p4d to run in internationalized mode. After p4d -xi sets up internationalized mode, you may then invoke p4d with your site's usual flags.) After setting the server to run in internationalized mode, your users must also set the P4CHARSET environment variable. Once set on the server, internationalized mode cannot be deactivated. (That is, you cannot return to non-internationalized mode.) Client configuration To use Perforce in an internationalized environment, you must also set the P4CHARSET environment variable on your client machines. If you do not set P4CHARSET on a client machine that is accessing a 2001.2 server running in internationalized mode, files synced to the client workspace can be corrupted. For example, if you attempt to retrieve a file of type "unicode" without setting P4CHARSET, you'll get only the raw UTF-8 data from the server. The following table lists recommended P4CHARSET values for supported character sets and platforms. Language Platform Windows Unix P4CHARSET Code page LOCALE setting ------------------------------------------------------------------ Japanese Windows 932 n/a shiftjis Japanese UNIX n/a varies eucjp Japanese UNIX n/a varies shiftjis High-ASCII Windows 1252 n/a winansi High-ASCII UNIX n/a varies iso8859-1 High-ASCII UNIX n/a varies iso8859-15 High-ASCII MacOS n/a varies macosroman untranslated All n/a n/a utf8* All All n/a n/a utf16* Unicode Client Byte-Order-Mark P4CHARSET Unicode written to settings Format files ------------------------------------------------------------------ utf8 UTF-8 No utf8-bom UTF-8 Yes utf16 UTF-16 in client byte order Yes utf16le UTF-16 in Little Endian order No utf16be UTF-16 in Big Endian order No utf16-nobom UTF-16 in client byte order No utf16le-bom UTF-16 in Little Endian order Yes (Windows Style) utf16be-bom UTF-16 in Big Endian order Yes (Note that eucjp is not a supported P4CHARSET value under Windows.) *Note that utf16 requires that P4COMMANDCHARSET be set to a different (non-utf16) charset for the p4 command line to function. Also, many p4 api based applications will not be able to support utf16 charsets without special work. *Note also that utf16 can be one of utf16, utf16-nobom, utf16le, utf16le-bom, utf16be, utf16be-bom which indicate if Byte-Order-Marks (BOMs) are desired and a paticular byte order is desired. Windows platforms will probably want to use utf16le-bom which matches most closely with Windows concept of Unicode files. The following notes about P4CHARSET also apply to P4COMMANDCHARSET. P4COMMANDCHARSET allows for a different charset for command input and output while allowing P4CHARSET to set the charset of file contents. *Note that utf8 is untranslated, but effective with clients built with the p4 api of 2005.2 or later will validate that file contents are in fact utf8. Previous clients did not validate utf8 file contents. Setting P4CHARSET on Windows: 1. Log in to Windows and open an MS-DOS command prompt. 2. Run chcp.exe ("CHangeCodePage") without any arguments to see your current code page. 3. Display your active code page on Windows machines by issuing the "chcp" command. Windows displays a message like the following: Active code page: 1252 4. Select the character set based on the active code page as follows: Code page Set P4CHARSET to: 1252 winansi 932 shiftjis To set P4CHARSET for all users on this workstation, you will need Administrator privileges. Issue the following command: p4 set -s P4CHARSET=[character_set] If you don't have Administrator privileges, you can use: p4 set P4CHARSET=[character_set] to set P4CHARSET for the user currently logged in. Other users on the same machine will have to set P4CHARSET independently. Setting P4CHARSET on UNIX: 1. Set P4CHARSET to the proper value from either a command shell or in a startup script such as .kshrc, .cshrc, or .profile. You can determine the proper value for P4CHARSET by examining the current setting of the LANG or LOCALE environment variable. Sample $LANG value: Set P4CHARSET to: en_US.ISO_8859-1 iso8859-1 ja_JP.EUC eucjp ja_JP.PCK shiftjis In general: For a Japanese installation, set P4CHARSET to eucjp For a European installation, set P4CHARSET to iso8859-1 Unicode file type Files of type "unicode" are stored in the depot in UTF-8. Perforce client programs use the P4CHARSET environment variable to determine how to translate the UTF-8 data in "unicode" files into the local character set. Only files of type "unicode" are translated; Perforce ignores P4CHARSET when retrieving or storing files of other file types. The first time you try to submit any file to the depot, Perforce attempts to determine its type by examining a portion (currently the first 8192 bytes) of the file. If P4CHARSET is unset: Files are (by default) assigned the filetypes "text" or "binary" depending on the presence of characters with the high bit set in the first part of the file. This is the default behavior of Perforce in a non-internationalized environment. If P4CHARSET is set: If nonprintable characters are detected, the file is assigned the type "binary". If there are no nonprintable characters, and there are high-ASCII characters, *and* those high-ASCII characters are translatable in the defined P4CHARSET, the file is deemed to be "unicode". Otherwise, the file is stored as type "text" (that is, both plain text files without high-ASCII characters, and files with high-ASCII characters that are undefined in the character set specified by P4CHARSET, are stored as type "text".) To override Perforce's default file type detection, you can: Specify the desired filetype on the command line, as in "p4 add -t unicode file.txt" or: Use the p4 typemap command to assign Perforce filetypes according to a file's extension. For example, the following table assigns the Perforce "unicode" filetype to text and html files, and the Perforce "binary" filetype to PDF files: Typemap: unicode //....txt unicode //....html binary //....pdf For more about using the typemap feature, refer to the Perforce System Administrator's Guide, or the "p4 typemap" page of the Command Reference. Diffing files The p4 diff2 command, which compares two files, can only compare files that have the same Perforce file type, either text or unicode. You cannot compare a text file to a unicode file. "CANNOT TRANSLATE" error message This message is displayed if your client machine is configured with a character set that does not include characters being sent to it by the Perforce server. Your client machine cannot display unmapped characters. For example, if your client machine is configured to use the shift-JIS character set and your depot contains files named using characters from the Japanese EUC character set that do not have mappings in shift-JIS, you will see the "Cannot translate..." error message when you execute a p4 files or p4 changes command that lists those files. To avoid translation errors, do not use unmapped characters (Japanese EUC character set that do not have mappings in shift-JIS) in the following Perforce elements: - user names or specifications - client names or specifications - jobs - file names Translation failures during file transfers will report a line number near where the translation failure occurred. Length limit for Unicode identifiers The Perforce server has internal limits on the lengths of strings used to index job descriptions, specify filenames, control view mappings, and identify client names, label names, and other objects. The most common limit is 1024 bytes. Because some characters in Unicode can expand to more than one byte, it's possible for certain Unicode entries to exceed Perforce internal limits. Because no basic Unicode character expands to more than three bytes, dividing the Perforce internal limit by three will ensure that no Unicode sequence will exceed the limit. To ensure that no Unicode sequence exceeds the Perforce limit, do not create client names or view patterns that exceed 341 Unicode characters. Under normal usage conditions, this is not expected to pose a significant limitation. Localization of error and informational messages The error and informational messages in Perforce have been internationalized. This means that you can read messages in your native language, if a translation has been provided (localization). if P4LANGUAGE is unset: By default all messages (info and error) are reported in English. if P4LANGUAGE is set: If a localization is available and your administrator has loaded the language specific messages into the Perforce database then you can activate native messages by setting P4LANGUAGE. example To have your messages returned in French set P4LANGUAGE to "fr". Administrator Notes The Perforce server operates in either an internationalized or a non- internationalized mode. For release 2001.2, internationalized mode is activated upon invocation of "p4d -xi" as described above. Only Perforce client programs at 2001.2 or above are able to interact with an internationalized server. P4CHARSET must be set for all such clients. The command line client ("p4") has a new global flag (-C) that overrides P4CHARSET settings. For instance: p4 -C winansi files //... displays all filenames in the depot, as translated using the winansi code page. Instructions for Translators (system integrators) To get a copy of the "English" message text file for translation contact technical support. To build a localized version of this file edit the text strings, taking care not to change any of the key parameters (except for the language code - note "en" changed to "fr"). It is also important not to change the named parameters (specified between %'s) i.e. %depot% must remain %depot% (even if there is a valid translation). example @pv@ 0 @db.message@ @en@ 822220833 @Depot '%depot%' unknown - use 'depot' to create it.@ to translate into French @pv@ 0 @db.message@ @fr@ 822220833 @Depot '%depot%' inconnu - utilisez 'depot' pour le creer.@ Once this file has been completely translated it can be loaded into Perforce with the following command: p4d -jr /fullpath/message.txt The user would have to set the correct language code to get the native messages, for this case P4LANGUAGE would be set to "fr". --- Changes since 2005.2/91006 (first release) #91576 * The following character mappings have changed to more closely match newer conventions and consistency in use of 'fullwidth' characters where possible. (Bug #19817) shiftjis eucjp new old new old code code unicode unicode description description 8160 A1C1 U+FF5E U+301C FULLWIDTH TILDE WAVE DASH 8161 A1C2 U+2225 U+2016 PARALLEL TO DOUBLE VERTICAL LINE 817C A1DD U+FF0D U+2212 FULLWIDTH HYPHEN-MINUS MINUS SIGN 8191 A1F1 U+FFE0 U+00A2 FULLWIDTH CENT SIGN CENT SIGN 8192 A1F2 U+FFE1 U+00A3 FULLWIDTH POUND SIGN POUND SIGN $Id: //depot/r05.2/p4-doc/user/i18nnotes.txt#4 $