Internationalization Notes for P4D, the Helix Versioning Engine and Helix client applications Version 2024.1 Introduction The Helix clients and server have an optional mode of operation where all metadata and some file content are stored in the server in the UTF8 Unicode character set and are translated into another character set on the client. When running in internationalized mode, all non-file data (identifiers, descriptions, and so on), as well as the content of all files of type "unicode" are translated between the character set specified by the P4CHARSET variable on the client and UTF8 in the server. Server configuration Before you use Perforce in an internationalized environment, you must first instruct your server to run in internationalized mode. Setting your server to run in internationalized mode: 1. Run "p4d -xi" This will verify that any and all existing metadata is valid UTF8 and set a protected counter "unicode" to instruct future invocations of p4d to operate in internationalized mode. (The "p4d" process invoked by "p4d -xi" does *not* start a Perforce server; rather, it terminates after instructing future invocations of p4d to run in internationalized mode. After p4d -xi sets up internationalized mode, you may then invoke p4d with your site's usual flags.) After setting the server to run in internationalized mode, your users must also set the P4CHARSET environment variable. Once set on the server, internationalized mode cannot be deactivated. (That is, you cannot return to non-internationalized mode.) Client configuration As of 2014.2, the P4CHARSET environment variable is no longer required. P4CHARSET may be set to avoid clients having to detect the Unicode nature of a server. If it is not set, clients may detect and remember the Unicode nature by setting an environment variable of the form 'P4__CHARSET'. See the server release notes for 2014.2 or documentation for details. A P4CHARSET value of 'auto' indicates that the client should make operating system specfic checks to determine what character set should be used. The following table lists recommended P4CHARSET values for supported character sets and platforms. Language Platform Windows Unix P4CHARSET Code page LOCALE setting ------------------------------------------------------------------ Japanese Windows 932 n/a shiftjis Japanese UNIX n/a varies eucjp Japanese UNIX n/a varies shiftjis Chinese All 936 varies cp936 Chinese All 950 varies cp950 Korean All 949 varies cp949 High-ASCII Windows 1252 n/a winansi High-ASCII Windows 1250 n/a cp1250 High-ASCII Windows 850 n/a cp850 High-ASCII Windows 852 n/a cp852 High-ASCII Windows 858 n/a cp858 High-ASCII Windows 437 n/a winoem High-ASCII UNIX n/a varies iso8859-1 High-ASCII UNIX n/a varies iso8859-2 High-ASCII UNIX n/a varies iso8859-15 High-ASCII MacOS n/a varies macosroman untranslated All n/a n/a utf8* Cyrillic All n/a varies iso8859-5 Cyrillic All n/a varies koi-r Cyrillic All 1251 n/a cp1251 Greek Windows 1253 n/a cp1253 Greek UNIX n/a n/a iso8859-7 All All n/a n/a utf16* Unicode Client Byte-Order-Mark P4CHARSET Unicode written to settings Format files ------------------------------------------------------------------ utf8 UTF-8 No utf8-bom UTF-8 Yes utf8unchecked UTF-8 (not validated) No utf8unchecked-bom UTF-8 (not validated) Yes utf16 UTF-16 in client byte order Yes utf16le UTF-16 in Little Endian order Yes (Windows Style) utf16be UTF-16 in Big Endian order Yes utf16-nobom UTF-16 in client byte order No utf16le-nobom UTF-16 in Little Endian order No utf16be-nobom UTF-16 in Big Endian order No utf32 UTF-32 in client byte order Yes utf32le UTF-32 in Little Endian order Yes utf32be UTF-32 in Big Endian order Yes utf32-nobom UTF-32 in client byte order No utf32le-nobom UTF-32 in Little Endian order No utf32be-nobom UTF-32 in Big Endian order No (Note that eucjp is not a supported P4CHARSET value under Windows.) *Note that utf16 and utf32 require that P4COMMANDCHARSET be set to a different (non-utf16 and non-utf32) charset for the p4 command line to function. Also, many p4 api based applications will not be able to support utf16 or utf32 charsets without special work. *Note also that utf16 can be one of utf16, utf16-nobom, utf16le, utf16le-nobom, utf16be, utf16be-nobom which indicate if Byte-Order-Marks (BOMs) are desired and a particular byte order is desired. Windows platforms will probably want to use utf16le which matches most closely with Windows concept of Unicode files. The following notes about P4CHARSET also apply to P4COMMANDCHARSET. P4COMMANDCHARSET allows for a different charset for command input and output while allowing P4CHARSET to set the charset of file contents. *Note that utf8 is untranslated, but effective with clients built with the p4 api of 2006.1 or later will validate that file contents are in fact utf8. Previous clients did not validate utf8 file contents. Setting P4CHARSET on Windows: 1. Log in to Windows and open an MS-DOS command prompt. 2. Run the chcp ("CHangeCodePage") command without any arguments to see your current code page. 3. Display your active code page on Windows machines by issuing the "chcp" command. Windows displays a message like the following: Active code page: 1252 4. Select the character set based on the active code page as follows: Code page Set P4CHARSET to: 1252 winansi 932 shiftjis 949 cp949 1250 cp1250 1251 cp1251 850 cp850 852 cp852 858 cp858 437 winoem 1253 cp1253 To set P4CHARSET for all users on this workstation, you will need Administrator privileges. Issue the following command: p4 set -s P4CHARSET=[character_set] If you don't have Administrator privileges, you can use: p4 set P4CHARSET=[character_set] to set P4CHARSET for the user currently logged in. Other users on the same machine will have to set P4CHARSET independently. Setting P4CHARSET on UNIX: 1. Set P4CHARSET to the proper value from either a command shell or in a startup script such as .kshrc, .cshrc, or .profile. You can determine the proper value for P4CHARSET by examining the current setting of the LANG or LOCALE environment variable. Sample $LANG value: Set P4CHARSET to: en_US.ISO_8859-1 iso8859-1 ja_JP.EUC eucjp ja_JP.PCK shiftjis In general: For a Japanese installation, set P4CHARSET to eucjp For a European installation, set P4CHARSET to iso8859-1 Unicode file type Files of type "unicode" are stored in the depot in UTF-8. Perforce client programs use the P4CHARSET environment variable to determine how to translate the UTF-8 data in "unicode" files into the local character set. Only files of type "unicode" are translated; Perforce ignores P4CHARSET when retrieving or storing files of other file types. The first time you try to submit any file to the depot, Perforce attempts to determine its type by examining a portion (currently the first 8192 bytes) of the file. If P4CHARSET is unset: Files are (by default) assigned the filetypes "text" or "binary" depending on the presence of characters with the high bit set in the first part of the file. This is the default behavior of Perforce in a non-internationalized environment. If P4CHARSET is set: If nonprintable characters are detected, the file is assigned the type "binary". If there are no nonprintable characters, and there are high-ASCII characters, *and* those high-ASCII characters are translatable in the defined P4CHARSET, the file is deemed to be "unicode". Otherwise, the file is stored as type "text" (that is, both plain text files without high-ASCII characters, and files with high-ASCII characters that are undefined in the character set specified by P4CHARSET, are stored as type "text".) To override Perforce's default file type detection, you can: Specify the desired filetype on the command line, as in "p4 add -t unicode file.txt" or: Use the p4 typemap command to assign Perforce filetypes according to a file's extension. For example, the following table assigns the Perforce "unicode" filetype to text and html files, and the Perforce "binary" filetype to PDF files: Typemap: unicode //....txt unicode //....html binary //....pdf For more about using the typemap feature, refer to the Perforce System Administrator's Guide, or the "p4 typemap" page of the Command Reference. UTF16 file type Files of type "utf16" are stored in the depot in UTF-8. These files are only in utf16 in the client workspace. Commands which output file contents such as p4 diff, p4 annotate, etc will attempt to translate content from the UTF-16 file into the P4CHARSET when in unicode mode rather than mixing UTF-16 content with non-UTF-16 content. Note that "p4 print" with the "-o" flag will write a file as UTF-16 while without the "-o" flag output the command will attempt to translate the output to the P4CHARSET. When adding files, UTF-16 files will prefer to be stored with the "utf16" filetype rather than the "unicode" filetype even if P4CHARSET is set to a utf16 encoding. This should allow UTF16 files to live side by side with other character sets. The automatic type detection requires a BOM be present at the start of the file. Files without a BOM are assumed to be in client byte order. When utf16 files are written to a client, such as with the 'p4 sync' command, they are written with a BOM and in client byte order. Diffing files The p4 diff2 command, which compares two files, can only compare files that have the same Perforce file type, either text or unicode. You cannot compare a text file to a unicode file. "CANNOT TRANSLATE" error message This message is displayed if your client machine is configured with a character set that does not include characters being sent to it by the Perforce server. Your client machine cannot display unmapped characters. For example, if your client machine is configured to use the shift-JIS character set and your depot contains files named using characters from the Japanese EUC character set that do not have mappings in shift-JIS, you will see the "Cannot translate..." error message when you execute a p4 files or p4 changes command that lists those files. To avoid translation errors, do not use unmapped characters (Japanese EUC character set that do not have mappings in shift-JIS) in the following Perforce elements: - user names or specifications - client names or specifications - jobs - file names Translation failures during file transfers will report a line number near where the translation failure occurred. Length limit for Unicode identifiers The Perforce server has internal limits on the lengths of strings used to index job descriptions, specify filenames, control view mappings, and identify client names, label names, and other objects. The most common limit is 1024 bytes. Because some characters in Unicode can expand to more than one byte, it's possible for certain Unicode entries to exceed Perforce internal limits. Because no basic Unicode character expands to more than three bytes, dividing the Perforce internal limit by three will ensure that no Unicode sequence will exceed the limit. To ensure that no Unicode sequence exceeds the Perforce limit, do not create client names or view patterns that exceed 341 Unicode characters. Under normal usage conditions, this is not expected to pose a significant limitation. Localization of error and informational messages The error and informational messages in Perforce have been internationalized. This means that you can read messages in your native language, if a translation has been provided (localization). if P4LANGUAGE is unset: By default all messages (info and error) are reported in English. if P4LANGUAGE is set: If a localization is available and your administrator has loaded the language specific messages into the Perforce database then you can activate native messages by setting P4LANGUAGE. example To have your messages returned in French set P4LANGUAGE to "fr-FR". Administrator Notes The Perforce server operates in either an internationalized or a non- internationalized mode. For release 2001.2, internationalized mode is activated upon invocation of "p4d -xi" as described above. Only Perforce client programs at 2001.2 or above are able to interact with an internationalized server. P4CHARSET must be set for all such clients. The command line client ("p4") has a new global flag (-C) that overrides P4CHARSET settings. For instance: p4 -C winansi files //... displays all filenames in the depot, as translated using the winansi code page. Instructions for Translators (system integrators) To get a copy of the "English" message text file for translation contact technical support. To build a localized version of this file edit the text strings, taking care not to change any of the key parameters (except for the language code - note "en" changed to "fr"). It is also important not to change the named parameters (specified between %'s) i.e. %depot% must remain %depot% (even if there is a valid translation). Square braces also require special care. example @pv@ 0 @db.message@ @en@ 822220833 @Depot '%depot%' unknown - use 'depot' to create it.@ to translate into French @pv@ 0 @db.message@ @fr@ 822220833 @Depot '%depot%' inconnu - utilisez 'depot' pour le creer.@ The character set of this file should be utf8. Once this file has been completely translated it can be loaded into Perforce with the following command: p4d -jr /fullpath/message.txt The user would have to set the correct language code to get the native messages, for this case P4LANGUAGE would be set to "fr". --- No new functionality for 2024.1 No new functionality for 2024.1 No new functionality for 2023.2 No new functionality for 2023.1 No new functionality for 2022.2 No new functionality for 2022.1 No new functionality for 2021.2 No new functionality for 2021.1 No new functionality for 2020.2 No new functionality for 2020.1 No new functionality for 2019.2 No new functionality for 2019.1