Failover

Failover is the process by which a standby server (or forwarding-standby server) replaces a server that provides standard, commit-server, or edge-server services. The server replaced during a failover is generally referred to as a "master" server.

High Availability and Disaster Recovery

The Failover feature supports two scenarios:

  • High Availability (HA)
    • The master can be configured as a master server, a commit server, or an edge server.

    • Typically, the standby server is in the same hardware rack as the master server

    • Typical use case: scheduled maintenance, but also possible if the master hardware fails

    • Typically, the master server participates in the failover process:

      • disabling itself in an orderly fashion

      • waiting for the journalcopy of the remaining transactions to the standby

      • allowing the standby to stop the master

      Note

      If the master server does not participate in the failover, a check is made to ensure that the standby server to which failover is to occur has the mandatory option set. Without the participation of the master server, failing over to a mandatory standby server is required to ensure that the other replicas remain consistent with the new master server after failover. Consistency is assured because during production operations, metadata must be journalcopy'd by all mandatory standby servers before that metadata is replicated to the other replicas. Deploying one or more mandatory standby servers local to the master server is recommended. This is because journalcopy performance of the mandatory standby servers can affect the production replication to the other replicas.

  • Disaster Recovery (DR)
    • Typical use case: due to a sudden catastrophe, the master server (and any HA standbys) are unable to operate.
    • Contact support for assistance with failing over to a non-mandatory standby server when the master server is inaccessible.

Consistency of the downstream replicas is assured for failing over when:

  • the master server participates, in which case:
    • the standby server need not be a "mandatory" standby
    • the standby server's journalcopy, pull -L, and pull -u threads are an integral part of the failover
  • the master server does not participate and the standby server is a "mandatory" standby, in which case only the standby server's pull -L thread is an integral part of the failover

Prerequisites for a successful failover

  • The p4 failover command must be run on a server of Type standby or forwarding-standby. See the Standby and forwarding-standby server topic.

  • The standby (or forwarding-standby) server must be appropriately licensed for its new role following the failover. We therefore recommend that you submit a Duplicate Server Request.
  • Make sure that monitoring (p4 monitor) is enabled for the new standby server (former master or commit server).
    • Monitoring must be enabled at server startup of the standby prior to running the p4 failover command because the monitor subsystem is used to terminate the journalcopy, pull -L, and pull -u threads during the failover sequence.

  • Open the server spec for each standby and forwarding-standby server. In the ReplicatingFrom field, enter the value of the master server's P4PORT. This will allow the statefailover file in the P4ROOT of the new master server to be automatically deleted when it is no longer needed.

  • If an edge server is being failed over, the service user of the edge server should be logged into the commit (or master) server using the file specified by the P4TICKETS configurable (and likely the P4TRUST configurable) defined for the standby of the edge server. For example, issue the following command on the standby server that will become the new master:
    p4 -E P4TICKETS=directory/.p4tickets -p master:port -u service-user:login
  • We recommend that a DNS alias point to the IP address of the master server. This allows the same DNS alias to point the new master server (former standby server).
Note

Failing over to a standby or forwarding-standby

Failing over to a dedicated standby is generally faster than failing over to a forwarding-standby. For situations where failover completion is less time-critical, you might want to consider a forwarding-standby. See "standby" and "forwarding standby" in p4 server in Helix Core P4 Command Reference.

High availability with the mandatory server specification option

Important

A high availability standby within an existing installation should not be initially deployed as mandatory.

To deploy standby servers with minimal interruption to replication, make sure the journalcopy thread of the new standby server is caught up with the server from which is it journalcopying BEFORE you set the standby to mandatory. Follow this process:

  1. Deploy the standby with the default, which is nomandatory
  2. To monitor the progress of the standby's journalcopy, on the server from which the standby is journalcopying, invoke p4 servers -J

    In this example, we have invoked p4 servers -J on master, and we see that standby2 has 400, which does not yet match the 682 value on master:

    Later, again on master, that is, the server from which the standby is journalcopying, we invoke p4 servers -J again.
    This example shows that that standby2 has progressed to 682, which matches master and indicates that standby2 has a current journalcopy.


  3. Change the server spec for standby2 to specify mandatory

    On the innermost master server, in the server specification for standby2, under Options, mandatory is now appropriate for a standby (or forwarding-standby) server. This option ensures that no replica has metadata that has not been copied to the journalcopy of all mandatory standby (or forwarding-standby) servers.

    If the master were unavailable, standby1, which is not a mandatory standby, could not be used for failover

    If the master is available, all four of the standbys could be used for failover.

Note

If the server from which failover is to occur is not participating in the failover (because the master is unavailable or the -i option causes the master to be ignored), the p4 failover command returns an error if it is running on a standby (or forwarding-standby) server that is not properly configured with the mandatory option.

Disaster recovery with the nomandatory server specification option

For disaster recovery failover, the remote standby typically has a server specification with the Option: field set to the default value, which is nomandatory. This is because the journalcopy performance of a mandatory standby can affect the speed of replication to the replicas of the master.

Potential data loss

If the master participates

  • Any commands that were not completed when failover began might need to be executed again on the new master server.

  • There should not be any data loss.

If the master does not participate

  • Standby is mandatory
  • Any commands that were not completed when failover began might need to be executed again on the new master server.

  • The transactions that were done directly on the master prior to the failover that had not yet been journalcopy'd to the standby being used for the failover will be lost.
  • To minimize data loss, the standby used for the failover should be the standby that was the most current with the master at the time of the failover. Typically, this is the standby that is in the same rack with the master.

    • The downstream replicas are consistent with the new master server

    • The downstream replicas will not have data loss relative to the new master server

Failover process

The Failover feature allows the super user to:

  1. Get a report of whether conditions look good for a successful failover.
    Warning

    If the report indicates that the existing master server is still accessible and ignoring that server has been requested with the -i option, this could result in two separate servers, each of which is unaware of the other. This "split-brain" situation can produce inconsistencies that compromise the integrity of your data.

  2. Initiate the failover process.
    1. This automatically stops the standby (or forwarding standby) server that will become the new master.

    2. During the failover process, the master server does not process any new commands and end-users get the "failoverMessage" (see the p4 failover command).

    3. A verification process ensures that recent file content was correctly replicated to the new master. See the p4 failover command for the -v option.

    4. During the failover process, the P4ROOT directory will get a new file named statefailover. This file is the last consistency point journalcopy'd by the standby immediately prior to the failover. This file will be deleted by the new master server when it is no longer needed.

      For example,
      p4 failover
      Make sure the preview looks OK.
      If so, then run
      p4 failover -y

  3. Monitor the steps that are reported during the process. If the Failover process encounters an error, the process is designed to inform the superuser and to stop the failover process so that corrective action can be taken and a new attempt can occur.

  4. If an error is encountered after the standby server has stopped the master server, the standby server will not restart the master server.

  5. Verify, after the completion of a successful failover, that the former standby (or forwarding standby) has been restarted as the new master by issuing the p4 info command and checking the ServerID to ensure that it is the same ServerID that the previous master server used.

  6. Following a successful failover, site-specific changes might be needed to use the new master server. It might be necessary to make DNS changes so that users and replicas can connect to the new master server. For example,

    • If you have a DNS alias set up, update the IP address of that DNS alias to point to IP address of new the master or commit server.
    • If you do not have a DNS alias set up,
      • change the P4TARGET environment variable on each replica or edge server by issuing the p4 configure show allserver command and issuing
        p4 configure set "replica-name#P4TARGET=new-master-server:port-number"
      • update your server specifications with the proper hostname and port number by issuing the p4 server servername command.

The end users can now issue new commands.

Note that you can Failback after failover.

  • Configurables affected

    The failover process:

    Configurables and edge server

    When failing over to a standby from an edge (or other replica) server, the updated configurables for the edge server will need to be manually changed on the commit server. This is because the update of the configurables cannot be propagated back to the commit (or upstream) server automatically, given that the edge server might, or might not, be participating in the failover.