Background:
There are two types of ACKs in play here.
- First is an inter-indexer ACK for data replication in an indexing cluster. When an indexer replicates a slice of data (when the slice hits 128K, or the slice is less than 128k and 60s elapses) to a replicate peer, it expects an ACK from that peer when the data has been received.
- Second is a forwarder ACK (useACK). This gets sent from the indexer to the forwarder when the indexer has successfully received ACKs indicating that RF-1 instances of that slice has been successfully replicated. So, if RF=4, then the indexer will send the ACK back to the forwarder when it has successfully received inter-indexer ACKs for 2 replicates, thus writing 3 copies (2 + itself) and satisfying RF-1 (3) replicates.
Scenario:
Right now I have a bunch of indexers split between two sites, none of which are clustered together. I would like to setup a multi-site indexing cluster, with SF=2, RF=4 (origin:2, site1: 2). I'll be turning on useACK on the forwarders so I don't lose any data. My team does Disaster Recovery testing, and I want to make sure that Splunk will still work (forwarders will get ACKs, indexers will index data, et al) during the DR test. The DR test itself will last 48 hours and consist of severing the links connecting the 2 data centers (siteA, siteB).
Splunk in both sites must continue to function properly during the test & when the links are brought back online.
The concern with this is that the indexer receiving the chunk of data from the forwarder MUST successfully complete RF -1 replications of the raw data to other peers before the ACK is sent back to the forwarders. With the WAN links disabled between the two sites, the best an indexer will ever be able to muster is 1 replicate, and will never get to the 2 replicates required to return the ACK to the forwarder. Thus, losing all visibility into what's going on in the environment, and this becomes a show stopper for rolling out multi-site indexer clustering.
What's going to happen?
↧