Using P4LOG

The P4LOG environment variable allows you to specify the name and path of the file to which Helix Server errors are written.

The default level of verbose logging is governed by hard-coded thresholds. Verbosity of the log file produced by Helix Server can be set by invoking p4d with the -vtrack flag.

The p4d server produces diagnostic output to help identify performance problems and is on by default but can be turned off or adjusted with the -vtrack=x flag to the server. Any user commands that exceed certain thresholds for resource usage (CPU, lapse time, database I/O, network I/O, among other things) automatically get logged into the server error log P4LOG. The levels that can be set with -vtrack=x are:

  • 0 turn off tracking
  • 1 track all commands
  • 2 track excess usage for a server < 10 users
  • 3 track excess usage for a server < 100 users
  • 4 track excess usage for a server < 1000 users
  • 5 track excess usage for a server > 1000 users

If -vtrack is not provided on the server command line or set with P4DEBUG, the tracking level is computed from the number of users listed in the server license file.

The exact format of the tracking output is not documented, and subject to change.

Unless a Helix Server is started using the -v track=1 option, it is not guaranteed to log every command. To ensure that every command is logged (not just those commands exceeding the default performance thresholds), set the -v server=1 flag in the p4d invocation for the installation. Note that multiple -v flags can be added to a p4d start command. For more information on server trace flags, see Diagnostic flags for monitoring the server.

Deciphering the first lines

Helix Server info:

2018/08/02 14:25:10 pid 16256 cpflaum@work2 127.0.0.1
[p4/2018.2/LINUX26X86_64/165458] 'user-sync -p //depot/WORK/...'
--- lapse 109s

The first two lines provide the information about what command was run, when it was run, what server was run on, and by whom. The lapse information on the third line is wall clock time, and shows that the command took 109 seconds to run.

The usage information format is based on the output of the getrusage command, common to many Linux systems.

Deciphering usage

--- usage 9383+3937us 0+0io 0+0net 0k 29916pf

9383+3937us

Time spent executing user instructions + Time spent in operating system code on behalf of processes. Times are represented in milliseconds (us = user and system)

0+0io

The number of times the file system had a 512-byte physical read from the disk on behalf of processes + the number of times the file system had a 512-byte physical write on behalf of processes.

The block reads, that is the "X" of the "X+Yio, are the physical reads charged to the process. A value of zero means that all data that was needed by the command was in the filesystem cache. Note that the block reads that are reported are not specific to any one table.

If the io is seen in a parallel sync such as

'user-transmit -t 61063 -b 8 -s 524288'=20
--- usage 56739+97512us 3448+21128824io 0+0net 12064k 0pf

this would be interpreted as a large number of physical disk writes (21,128,824 512-byte disk blocks) which includes pages written to db.have to update the client's have list after receiving files and pages written to db.sendq removing records corresponding to the files that this transmit thread sent to the client. The physical disk writes are those that occur prior to the fsync() call on a modified db.*.

0+0net

IPC (Inter Process Communication) messages received on behalf of processes + IPC message sent on behalf of processes.

0k

The maximum number of kilobytes of physical memory that processes used simultaneously.

29916pf

The number of page faults that were serviced by doing I/O.

Deciphering RPC (remote procedure call)

--- rpc msgs/size in+out 3+431751/0mb+1417mb {himarks 64835/64835}

The above output example displays the number of RPC messages and size of the data transfers to and from the disk, in megabytes. In this case, a 1.417 GB was transferred from the server to the local client during a sync command. While 3 messages were sent back, the size of those messages was small enough to be a rounding error, thus zero megabytes. A submit command would show the opposite condition, with a large number of messages and data being sent to the server, with fewer coming back to the client.

Optionally, "himarks" is displayed to show the send/receive window size used for the RPC messages. This can be useful to catch some client issues where the window sizes (in bytes) may be set incorrectly by the OS. For example, an issue with older Perforce Windows clients with an artificially low send window size would appear as:

msgs/size in+out 0+2/0mb+0mb himarks 64835/2000
--- rpc receive errors, duplexing F/R 0/0

server-to-server network usage on replicas

2022.2 release and later: The track output in server log files has been expanded to include network data for upstream server-to-server communication, such as edge server to commit server.

Prior to 2022.2, server log files only tracked the network data for the client or downstream server.

The upstream server-to-server communication is recorded as additional rpc lines.

edge to commit example 1: unstructured log with additional lines

These lines show the target server's P4PORT in parentheses in the P4LOG file. In the examples below of an edge server communicating with its commit server, note the rpc (commit:1666) format, where 1666 represents the commit server's port.

Perforce server info: 
2022/07/19 07:34:58 pid 11392 user@ws1 127.0.0.1 [p4/2022.2.main/NTX64/2249297] 'user-submit -d test'
--- lapse .484s 
--- rpc msgs/size in+out 83+9/4mb+0mb himarks 64836/64836 snd/rcv .000s/.031s
--- rpc (commit:1666) msgs/size in+out 13+87/0mb+4mb himarks2000/2000 snd/rcv .016s/.077
--- db.counters
--- pages in+out+cached 6+0+2
--- locks read/write 2/1 rows get+pos+scan put+del 3+0+0 0+0

edge to commit example 2: structured log with NetworkPerformance events

For the track.csv structured log, the same network usage data is presented with the rpc-commit%3A1666 format, with the %3A ASCII expansion encoding for the colon (:) that precedes the target server's port number:

8.55,1658241299,291000000,2022/07/19 07:34:59 291000000,11392,36AC4D651743EC5D0D614E4FB911719F,replica2,1,user,ws1,user-submit,127.0.0.1,p4,2022.2.main/NTX64/2249297,-d:test,,rpc,83,9,5005350,1235,64836,64836,31,0,,,0,0
8.55,1658241299,291000000,2022/07/19 07:34:59 291000000,11392,36AC4D651743EC5D0D614E4FB911719F,replica2,1,user,ws1,user-submit,127.0.0.1,p4,2022.2.main/NTX64/2249297,-d:test,,rpc-commit%3A1666,13,87,1101,5004588,2000,2000,77,16,,,0,0

Note that any characters that might cause issues in the comma-separated value (CSV) file, such as the colon, undergo ASCII expansion. See Limitations on characters in filenames and entities in Helix Core Command-Line (P4) Reference.

Both examples show himarks at the default value of 2000. An edge server communicating with its commit server is considered a client of the upstream server. The himarks are not used on client connections, and therefore himarks remain at the default value.

See also NetworkPerformance (type 8) events under Structured Log Events.

Deciphering lock times

This section concerns how many files and revisions were scanned in affected database files, and information about file locking on the Helix Server.

Read Locks

A read lock still allows other processes to read the particular table in the Perforce database. However, any write lock must wait for the read lock to complete.

Below is a command to find all read locks held for more than 10 seconds. Replace the <log> with the log filename. The results will be stored in file: longReadHeldTimes.log. You can then search the log for these times to find the processes that were holding read locks.

grep "total lock wait+held read/write " <log> | awk '{print $6}' | awk -F/ '{print $1}' | awk -F+ '{print $2}' | sed s/ms//g | sort -unr | sed '/.\{5\}/!d' > longReadHeldTimes.log

Write Locks

A write lock blocks any other processes from reading or writing to the particular table in the database.

If a process is waiting its turn for an action to complete, it is a victim. If a process locks the database table and other processes must wait, then it is a culprit. If a large number for read or write lock wait time is seen, look for an earlier corresponding large write lock hold time. This will identify the process holding other processes off.

Below is a command to find all write locks held for more than 10 seconds. You will need to replace the <log> with actual log filename. The results will be stored in file: longWriteHeldTimes.log. You can then search the log for these times to find the processes that were holding write locks.

grep "total lock wait+held read/write " <log> | awk '{print $6}' | awk -F+ '{print $3}' | sed s/ms//g | sort -unr | sed '/.\{5\}/!d' > longWriteHeldTimes.log

Good performance Example

--- db.resolve
---   total lock wait+held read/write 0ms+252ms/0ms+0ms

No time was spent waiting for the db.resolve table locks, and the process held a read lock for 252ms. The lock is only held for a fraction of of a second and therefore is not tying up the db.resolve database file.

Slow performance example

--- total lock wait+held read/write 23763ms+463ms/582568ms+343ms

The process was blocked for 23.763s waiting for a read lock (so it is a victim of another process holding the table).

The process held a read lock for 0.463 seconds (a relatively fast operation).

The process waited for 582.568 seconds before it could issue its own write lock (so now it is very much a victim of one or more processes that could be holding locks on multiple tables).

The process wrote for 0.343 seconds (still a fairly fast operation).

Therefore this process is not causing a performance slowdown, but it was a victim. As an example, the user might believe that Perforce was stalled or crashed during this time, however the server is still running, albeit very slowly. This is typical of a server undergoing a heavy usage spike, and will often "unwind" when the load drops.

Total Lock versus Max Lock

The total lock time is the total time that all locks of each type (read or write) for the table were blocked (wait) and held for the operation. The max lock time is the longest time that any one lock of each type for the table was blocked and held for the operation so it can be seen if one lock used up most of the time. If no more than one lock of each type was taken for the table, no "max lock" tracking entry is reported.

For example, the db.monitor output below lists both total and max locks:

...
$ p4 -Ztrack changes -m366 //depot/blaster/engine/... | grep "\-\-\-"
...
--- db.monitor
---   pages in+out+cached 3+4+2
---   locks read/write 15/2 rows get+pos+scan put+del 15+0+0 1+1
---   total lock wait+held read/write 0ms+1ms/0ms+0ms
---   max lock wait+held read/write 0ms+1ms/0ms+0ms

For this changes command, 15 read locks and 2 write locks were taken for

db.monitor. For these locks:

The time that the 15 read locks were blocked totaled 0ms.

The time that the 15 read locks were held totaled 1ms.

The time that the 2 write locks were blocked totaled 0ms.

The time that the 2 write locks were held totaled ~0ms.

The maximum time that any one of the 15 read locks was blocked was 0ms.

The maximum time that any one of the 15 read locks was held was 1ms.

The maximum time that any one of the 2 write locks was blocked was 0ms.

The maximum time that any one of the 2 write locks was held was ~0ms.

Deciphering change/changelistnumber()

You may see an entry similar to change/3082654(W). It is an exclusive server lock on change 3082654 to ensure that there is no concurrent access to shelf 3082654 while it is being worked on.

Perforce server info:
2018/10/19 09:04:39 pid 11571 bruno@gabriel 172.91.82.126/172.91.10.167 
p4/2018.1/LINUX26X86_64/1559260 (brokered)] 'user-shelve -c 3082654 -d'
--- lapse 4427s
--- change/3082654(W)
---   total lock wait+held read/write 0ms+0ms/0ms+4427001ms

Buffering

In the course of processing a command the server reads and writes data pages to and from a buffer cache:

--- db.revhx
---   pages in+out+cached 182412+0+64

in = number of pages read into the buffer/cache.

out = number of pages written out of the buffer/cache.

cached = number of pages retained in the buffer/cache.

The "cached" value is not an indication that the db.* table is in the filesystem cache. Rather, the "cached" value reflects the maximum number of pages in the (small) process-private memory for each db.* file that is used to minimize reads by caching pages that might be read more than once by only that process.

Table scanning

"Get" fetches a single row given a key. "Position" and "Scan" work together to fetch many rows -- the key is given to "Position", and then "Scan" returns rows after that key.

For example, "Get" is used to get the db.domain record for a client. "Position/Scan" are used to get all db.have records for a client.

--- db.revhx
---   locks read/write 1/0 rows get+pos+scan put+del 0+1+26635 0+0

The database only had to position once and scanned 26,635 revisions. If you see a lot of positioning to read relatively few rows (records), you might need to examine your protection table, client mappings, and command structure. Typically the fewer positions per file, the more efficient the process.

The -Ztrack flag

You can also use the -Ztrack flag on the p4 command-line to give database usage and locking information:

p4 -Ztrack submit

Once the command is completed the resulting database statistics are displayed in the same format as the log file.

Additional tools

To help analyze your server and log files, consider using:

  • TRACK2SQL - a publicly available script that will create a mysql database of process information from a log file. This can be useful for identifying large processes that may affect overall server performance.
  • Simple P4D Log Analysis - a simple command to analyze vtrack results.

If you have questions about your server performance and want assistance in your log file analysis, contact Perforce support.