System resources

The Helix Core Server has the ability to monitor the CPU pressure and memory usage. In addition, the server can reduce the amount of work the server accepts when resources become limited by

  • pausing incoming commands when medium thresholds are reached

  • terminating incoming commands when high thresholds are reached

The server thereby regulates resource usage and maintains more consistent performance. Such resource monitoring can prevent large spikes of resource usage because the server spreads the load over a longer period of time.

Servers with a high number of concurrently-running commands benefit the most. Configure your server monitoring so that the thresholds are above the day-to-day ceiling of load. This results in the least amount of performance change from the baseline. Setting the thresholds too low will result in resource under-utilization and unnecessary slowdowns.

For additional information and advice about how to set the limits, see the output of p4 help server-resources.

Important

The system resource monitoring configurables described below are in technical preview.

Features offered in Technology Preview are experimental and not guaranteed to always work as expected. If you have feedback and functionality suggestions, email [email protected].

Configurables

When any resource has exceeded its configured threshold, the server begins pausing incoming commands until resource usage has dropped below the threshold.

pausing

Configurable

Meaning

sys.pressure.max.pause.time

Number of seconds a command is able to spend in the paused state before the server returns an error to the client. Setting this configurable to 0 disables pausing commands entirely.

Tip

Prior to enabling the full resource monitoring configuration, run in preview-mode for a while to gauge the effect of the configured thresholds. Preview-mode is when the feature is configured, but sys.pressure.max.pause.time=0 so that the p4 admin resource-monitor background task is sampling resources and setting pressure levels but commands are not subject to pausing.

sys.pressure.max.paused

Maximum number of concurrently-paused client commands on the server. New incoming commands above this threshold will be rejected with an error.

Some administrative and replication commands are not subject to pausing.

A Helix Core Server Extension could register a "pressure-pause" hook to be called periodically while a command is paused. Such an Extension could continue the pause, unpause, or defer the choice to the server.

A paused command is visible with p4 monitor show. The time spent in the paused state is recorded in the tracking entries of the server log files. See paused state in P4LOG in Helix Core Server Administrator Guide.

percentage-based memory

Percentage-based memory, ranged 0-100, are based on the ratio of total system memory vs memory available to use without swapping:

Configurable

Meaning

sys.pressure.mem.medium When commands begin pausing.
sys.pressure.mem.high The operating system is about to thrash into swap or become unstable, such as having processes subject to the Out Of Memory Killer. Existing commands that request more memory while the server is above the 'high' threshold might be canceled and return an error to the client.

OS-supplied resource pressure thresholds

Where available, the server also uses more accurate operating system resource interfaces to get information about high memory usage. This is available on Linux with the cgroup v2 mechanism. See the Linux kernel's definition of Pressure Stall Information (PSI) for further details.

Configurable

Meaning

sys.pressure.os.mem.medium (Linux) The server tries to keep memory usage below the 'medium' level.
sys.pressure.os.mem.high (Windows and Linux) Amount of time some processes on the system are spending stalled waiting for the memory. New incoming commands received by the server while at this threshold are rejected. Existing commands that request more memory while the server is above this threshold might be canceled and return an error to the client. When the Helix Core Server limits its work, it does not distinguish between memory used by other processes on the operating system and its own. For example, if a large external process comes and consumes a large amount of memory, the Helix Core Server can throttle itself in response.
sys.pressure.os.cpu.high

(Linux) This configurable represents the amount of time some processes on the system are waiting for CPU time. For Linux cgroup support, only the system-wide /proc/pressure/* files are considered.

CPU monitoring is only available if cgroups v2 support is enabled. For example, on RHEL 8:

Configure the system to mount 'cgroup-v2':

    `grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1 psi=1"`

Reboot, then verify that `cgroup-v2` was mounted:

    mount -1 | grep cgroup
    cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)

Duration

Number of milliseconds for averaging samples of resource pressure. A larger duration makes the server less sensitive to changes in pressure. The 'high' threshold must be set higher than the corresponding 'medium' threshold. Otherwise, the default values are used.

Configurable

Meaning

sys.pressure.mem.high.duration Number of milliseconds for averaging sys.pressure.mem.high
sys.pressure.mem.medium.duration Number of milliseconds for averaging sys.pressure.mem.medium
sys.pressure.os.cpu.high.duration Number of milliseconds for averaging sys.pressure.os.cpu.high
sys.pressure.os.mem.high.duration Number of milliseconds for averaging sys.pressure.os.mem.high
sys.pressure.os.mem.medium.duration Number of milliseconds for averaging sys.pressure.os.mem.medium

Prerequisites

The prerequisites for the server to be able to respond to resource pressure are:

  • Operating system support

  • Real-time monitoring counters enabled

  • The p4 admin resource-monitor server startup command running. See the command-line output of p4 help admin-resource-monitor

  • Existing resource usage baselines have been established

  • Enough space between the baseline usage and the medium/high thresholds such that the command-pausing is not always on

For example:

p4 serverid $name
p4 configure set rt.monitorfile=$monitor_file
p4 configure set "$name#startup.1=admin resource-monitor"
p4 admin restart

Exceptions to pausing

The following commands are not subject to being paused under pressure:

admin configure counter counters dbstat dbverify depots diskspace export extension failback failover heartbeat info journalcopy journaldbchecksums journals lockstat logappend login login2 logout logparse logrotate logschema logstat logtail monitor passwd ping protect pull serverid servers topology triggers user

Preview and activate monitoring

Follow these steps to preview monitoring, and then activate monitoring.

  1. Set a server ID name with p4 serverid $name

  2. Enable real-time monitoring with p4 configure set rt.monitorfile=$monitor_file

  3. Enable the resource monitoring background process with p4 configure set "$name#startup.1=admin resource-monitor"

  4. Enable preview mode (no pausing) with p4 configure set sys.pressure.max.pause.time=0

  5. Restart the server with p4 admin restart

  6. To preview what the pauses would be, check the "Server under resource pressure. Pause rate " message in the log entries of the p4 admin resource-monitor background task.

  7. Adjust the configurables, if necessary, until you are satisfied with the preview results

  8. Activate monitoring with p4 configure unset sys.pressure.max.pause.time

Disable monitoring

To disable monitoring, do any of the following:

  • Turn off the p4 admin resource-monitor startup command, either with p4 monitor terminate, or by removing the startup configurable and restarting the server.

  • Change the sys.pressure.max.pause.time configurable to 0

  • Change the values of the configurable thresholds (sys.pressure.os.*.high) to 0.

Example of resource pressure actions

When resource pressure thresholds are reached or exceeded, you might see results similar to these examples.

Server log:

2024/06/19 12:25:31 560465376 pid 1056102: Server is now using 55 active threads.
2024/06/19 12:25:31 560486548 pid 1056102: Server now has 10 paused threads.

where paused threads are due to resource monitoring being active.

For any paused command, if the track configurable is set to 1, you might see:

Perforce server info:
2024/06/19 12:25:31 pid 1056864 perforce@ip-10-0-0-106 127.0.0.1 [p4/2024.1/LINUX26X86_64/2611120] 'user-fstat -Ob //...'
--- lapse 8.39s
--- paused 1.20s
--- usage 598+67us 304+0io 0+0net 68864k 0pf
--- memory cmd/proc 74mb/74mb
...

which indicates that the command was paused for 1.20 seconds.

For a command that was rejected due to resource monitoring thresholds being exceeded, the entry from track output would say exited on fatal server error:

Perforce server error:
Date 2024/06/19 12:25:31:
Pid 1056860
Operation: user-fstat
Operation 'user-fstat' failed.
Too many commands paused;  terminated.
Perforce server info:
2024/06/19 12:25:31 pid 1056860 completed .008s 0+0us 0+8io 0+0net 12828k 0pf
Perforce server info:
2024/06/19 12:25:31 pid 1056860 perforce@ip-10-0-0-106 127.0.0.1 [p4/2024.1/LINUX26X86_64/2611120] 'user-fstat -Ob //...'
--- exited on fatal server error
--- lapse .008s
--- memory cmd/proc 21mb/21mb
...

and the error entry would also be in the errors.csv structured log.