Helix Core 2018.2 Failover
November 8, 2018

New, Better Control for Helix Core Service Interruptions

Version Control

The new p4 failover command provides for better administrative control of both planned and unplanned service interruptions of a Helix Core server providing standard, commit-server, or edge-server services.

Helix Core has supported standby and forwarding-standby server services, generally referred to as standby servers, for some time. A server with either of these services makes a byte-for-byte copy of the journal from a target server. The byte-for-byte copy of the journal is often referred to as the journalcopy, which is conveniently confused with the standby server's "journalcopy" background thread. Context is everything! This journalcopy is responsible for retrieving transactions from their target server's journal. The target server is generally referred to as the master server. Though it could be a Helix Core server providing standard, commit-server, or edge-server services.

Customers have used custom scripts to move activity, or failover, from a master server to a standby server. The functionality of some of these scripts has been described as "transferring the flag from one server to another." Additional Helix components were sometimes used with the scripts, such as the Helix Broker, for returning to users an informative message during the failover.

While the scripted solutions address some of the functionality needed for a successful failover, a number of customers have requested that the failover functionality be folded into the Helix Core server.

Failover Functionality

As part of the 2018.2 release of the Helix Core, the p4 failover command has been introduced. This command is run on a standby server and orchestrates the functionality needed to fail over from its master server.

The command can be used to fail over from a master server that is no longer accessible. But if the master server is accessible, the command will by default fail over with the master server participating for the best failover possible – including an orderly shutdown of the master server at the right time during the failover sequence.

Though it's possible to ignore an accessible master server while failing over, care should be taken to ensure that ignoring a healthy master server does not result in two master servers processing divergent datasets. If instructed to ignore a master server that is accessible, the p4 failover command will issue a warning about the potential split-brain scenario.

Setting a Mandatory Standby Server

Also in the 2018.2 release of Helix Core is setting a “mandatory” option for standby servers. If the master server does not participate, failing over to a mandatory standby server ensures that the downstream replicas are consistent with the new master server.

Consistency of the downstream replicas is assured since transactions must be journalcopy'd by all mandatory standby servers before the transactions are replicated to any downstream replicas. But if the master server does participate in the failover, either a mandatory or non-mandatory standby server can be used for the failover.

Since the performance of a mandatory standby server's journalcopy thread affects when transactions are replicated, mandatory standby servers should be deployed at locations relative to the master server that favor good journalcopy performance. For example, a standby server deployed in the same rack as the master server for the purposes of high availability is generally a good candidate as a mandatory standby server.

It is recommended that a standby server first be deployed without the mandatory option and only consider changing that option for the standby server after validating its journalcopy performance during heavier usage on the master server. A standby server's journalcopy performance can be monitored by running the p4 servers -J command on the master server. Once validated, the mandatory option can be set for the standby server by running the p4 server <standby-serverID> command on the innermost master server. Changing the mandatory option will have immediate effect if the master server is running. The mandatory option can be changed while the standby server is running.

Consistent Replicas Following Failover

Other changes have been made in Helix Core 2018.2 release to ensure the consistency of servers following a failover. One such change can affect when the transactions journalcopy'd to a standby server are applied to that standby server's metadata.

In prior releases, transactions were applied to a standby server's metadata by its "pull -L" thread as the transactions were journalcopy'd from the master server. As of the current release, if at least one of the standby servers is mandatory, the transactions must be journalcopy'd by all mandatory standby servers before the transactions are applied to the metadata of any standby server. This, and other related changes, ensures the consistency of the metadata on all standby servers with the new master server when failing over to a mandatory standby server without the participation of the old master server.

We encourage you to configure and test this new functionality so that you can utilize it when it might be needed most. And as always, we appreciate your feedback. For additional information on the new failover functionality, see the Helix Core Server Administrator Guide and the Helix Core 2018.2 release notes