Messages received by a replica may be marked with an special flag that indicates the message is permanent. Custom replicated applications will receive notification of this flag via the DB_REP_ISPERM return value from the method. There is no hard requirement that a replication application look for, or respond to, this return code. However, because robust replicated applications typically do manage permanent messages, we introduce the concept here.
A message is marked as being permanent if the message affects transactional integrity. For example, transaction commit messages are an example of a message that is marked permanent. What the application does about the permanent message is driven by the durability guarantees required by the application.
For example, consider what the replication framework does when it has permanent message handling turned on and a transactional commit record is sent to the replicas. First, the replicas must transactional-commit the data modifications identified by the message. And then, upon a successful commit, the replication framework sends the master a message acknowledgment.
For the master (again, using the replication framework), things are a little more complicated than simple message acknowledgment. Usually in a replicated application, the master commits transactions asynchronously; that is, the commit operation does not block waiting for log data to be flushed to disk before returning. So when a master is managing permanent messages, it typically blocks the committing thread immediately before commit() returns. The thread then waits for acknowledgments from its replicas. If it receives enough acknowledgments, it continues to operate as normal.
If the master does not receive message acknowledgments — or, more likely, it does not receive enough acknowledgments — the committing thread flushes its log data to disk and then continues operations as normal. The master application can do this because replicas that fail to handle a message, for whatever reason, will eventually catch up to the master. So by flushing the transaction logs to disk, the master is ensuring that the data modifications have made it to stable storage in one location (its own hard drive).
There are two reasons why you might choose to not implement permanent messages. In part, these go to why you are using replication in the first place.
One class of applications uses replication so that the application can improve transaction through-put. Essentially, the application chooses a reduced transactional durability guarantee so as to avoid the overhead forced by the disk I/O required to flush transaction logs to disk. However, the application can then regain that durability guarantee to a certain degree by replicating the commit to some number of replicas.
Using replication to improve an application's transactional commit guarantee is called replicating to the network.
In extreme cases where performance is of critical importance to the application, the master might choose to both use asynchronous commits and decide not to wait for message acknowledgments. In this case the master is simply broadcasting its commit activities to its replicas without waiting for any sort of a reply. An application like this might also choose to use something other than TCP/IP for its network communications since that protocol involves a fair amount of packet acknowledgment all on its own. Of course, this sort of an application should also be very sure about the reliability of both its network and the machines that are hosting its replicas.
At the other end of the extreme, there is a class of applications that use replication purely to improve read performance. This sort of application might choose to use synchronous commits on the master because write performance there is not of critical performance. In any case, this kind of an application might not care to know whether its replicas have received and successfully handled permanent messages because the primary storage location is assumed to be on the master, not the replicas.
With the exception of a rare breed of replicated applications, most masters need some view as to whether commits are occurring on replicas as expected. At a minimum, this is because masters will not flush their log buffers unless they have reason to expect that permanent messages have not been committed on the replicas.
That said, it is important to remember that managing permanent messages involves a fair amount of network traffic. The messages must be sent to the replicas and the replicas must then acknowledge the message. This represents a performance overhead that can be worsened by congested networks or outright outages.
Therefore, when managing permanent messages, you must first decide on how many of your replicas must send acknowledgments before your master decides that all is well and it can continue normal operations. When making this decision, you could decide that all replicas must send acknowledgments. But unless you have only one or two replicas, or you are replicating over a very fast and reliable network, this policy could prove very harmful to your application's performance.
Therefore, a common strategy is to wait for an acknowledgment from a simple majority of replicas. This ensures that commit activity has occurred on enough machines that you can be reliably certain that data writes are preserved across your network.
Remember that replicas that do not acknowledge a permanent message are not necessarily unable to perform the commit; it might be that network problems have simply resulted in a delay at the replica. In any case, the underlying DB replication code is written such that a replica that falls behind the master will eventually take action to catch up.
Depending on your application, it may be possible for you to code your permanent message handling such that acknowledgment must come from only one or two replicas. This is a particularly attractive strategy if you are closely managing which machines are eligible to become masters. Assuming that you have one or two machines designated to be a master in the event that the current master goes down, you may only want to receive acknowledgments from those specific machines.
Finally, beyond simple message acknowledgment, you also need to implement an acknowledgment timeout for your application. This timeout value is simply meant to ensure that your master does not hang indefinitely waiting for responses that will never come because a machine or router is down.
How you implement permanent message handling depends on which API you are using to implement replication. If you are using the replication framework, then permanent message handling is configured using policies that you specify to the framework. In this case, you can configure your application to:
Ignore permanent messages (the master does not wait for acknowledgments).
Require acknowledgments from a quorum. A quorum is reached when acknowledgments are received from the minimum number of electable replicas needed to ensure that the record remains durable if an election is held.
The goal here is to be absolutely sure the record is durable. The master wants to hear from enough electable replicas that they have committed the record so that if an election is held, the master knows the record will exist even if a new master is selected.
This is the default policy.
Require an acknowledgment from at least one replica.
Require acknowledgments from all replicas.
Require an acknowledgment from a peer. (The replication framework allows you to designate one environment as a peer of another).
Require acknowledgments from all peers.
Note that the replication framework simply flushes its transaction logs and moves on if a permanent message is not sufficiently acknowledged.
For details on permanent message handling with the replication framework, see Permanent Message Handling.
If these policies are not sufficient for your needs, or if you want your application to take more corrective action than simply flushing log buffers in the event of an unsuccessful commit, then you must use write a custom replication implementation.
For custom replication implementation, messages are sent from the master to its replica using a send() callback that you implement. Note, however, that DB's replication code automatically sets the permanent flag for you where appropriate.
If the send() callback returns with a non-zero status, DB flushes the transaction log buffers for you. Therefore, you must cause your send() callback to block waiting for acknowledgments from your replicas. As a part of implementing the send() callback, you implement your permanent message handling policies. This means that you identify how many replicas must acknowledge the message before the callback can return 0. You must also implement the acknowledgment timeout, if any.
Further, message acknowledgments are sent from the replicas to the master using a communications channel that you implement (the replication code does not provide a channel for acknowledgments). So implementing permanent messages means that when you write your replication communications channel, you must also write it in such a way as to also handle permanent message acknowledgments.
For more information on implementing permanent message handling using a custom replication layer, see the Berkeley DB Programmer's Reference Guide.