Hi guys,
since I still can not open a support case, I can only try it here (I've tried so many times to get this issue resolved, but yea, it's not like we're paying a lot of money for support).
We already had some issues with duplicate events in the past and always resolved them. But not this time, it seems. And it is quite a problem now.
Recently, there was a storage outage in our datacenter. The Splunk VMs were running all the time (single site cluster with its VMs in two datacenters) but there were quite some disturbances and replication issues.
We have two environments. Due to our many differnt (V)LAN zones, we have a couple of Heavy Forwarders set up in each zone we need to collect data from.
**Now here's the problem:**
In our testing environment, there is a total of 4 HFs, but **only one** is sending duplicate data to our two indexers. This is not a lot of data, so I could live with it being resolved at a later point.
In our productive environment however, there is a total of five HF and now we got 80-100 hosts sending its data duplicated (in the last 15 minutes: 87 hosts from 15 different sources, into 37 differnt indexes).
I have searched errors and warnings for some time now and found a few log events, googled them but still have those problems.
Restarting the indexer cluster, removing excess-buckets from the master or restarting the Heavy Forwarders did not resolve any issues. No configurations have been changed (I tried disabling/re-enabling useACK on one HF but had no luck whatsoever).
Here are a few errors:
SERVER2 14:13:25.737 +0200 ERROR S2SFileReceiver - event=statSize replicationType=eJournalReplication bid=_internal~380~B3A9C962-6814-49A6-A47E-593741B331A3 path=/opt/splunk/var/lib/splunk/_internaldb/db/380_B3A9C962-6814-49A6-A47E-593741B331A3/rawdata/journal.gz status=failed
SERVER2 14:12:44.855 +0200 ERROR S2SFileReceiver - event=statSize replicationType=eJournalReplication bid=_internal~337~E432BEC8-63C5-4DCE-A500-90756157F30F path=/opt/splunk/var/lib/splunk/_internaldb/db/337_E432BEC8-63C5-4DCE-A500-90756157F30F/rawdata/journal.gz status=failed
SERVER2 14:12:44.732 +0200 ERROR S2SFileReceiver - event=statSize replicationType=eJournalReplication bid=00_p_INDEX4_14~264~E432BEC8-63C5-4DCE-A500-90756157F30F path=/opt/splunk/var/lib/splunk/00_p_INDEX4_14/db/264_E432BEC8-63C5-4DCE-A500-90756157F30F/rawdata/journal.gz status=failed
SERVER2 14:12:44.735 +0200 ERROR S2SFileReceiver - event=statSize replicationType=eJournalReplication bid=_audit~112~E432BEC8-63C5-4DCE-A500-90756157F30F path=/opt/splunk/var/lib/splunk/audit/db/112_E432BEC8-63C5-4DCE-A500-90756157F30F/rawdata/journal.gz status=failed
SERVER2 14:12:44.718 +0200 ERROR S2SFileReceiver - event=statSize replicationType=eJournalReplication bid=60_p_INDEX1_14~96~E432BEC8-63C5-4DCE-A500-90756157F30F path=/opt/splunk/var/lib/splunk/60_p_INDEX1_14/db/96_E432BEC8-63C5-4DCE-A500-90756157F30F/rawdata/journal.gz status=failed
SERVER2 14:12:44.844 +0200 ERROR S2SFileReceiver - event=statSize replicationType=eJournalReplication bid=00_p_INDEX2_14~301~E432BEC8-63C5-4DCE-A500-90756157F30F path=/opt/splunk/var/lib/splunk/00_p_INDEX2_14/db/301_E432BEC8-63C5-4DCE-A500-90756157F30F/rawdata/journal.gz status=failed
SERVER2 14:12:44.825 +0200 ERROR S2SFileReceiver - event=statSize replicationType=eJournalReplication bid=00_p_INDEX3_14~533~E432BEC8-63C5-4DCE-A500-90756157F30F path=/opt/splunk/var/lib/splunk/00_p_INDEX3_14/db/533_E432BEC8-63C5-4DCE-A500-90756157F30F/rawdata/journal.gz status=failed
..and a few warnings:
SERVER2 14:12:44.825 +0200 WARN S2SFileReceiver - unable to remove dir=/opt/splunk/var/lib/splunk/00_p_INDEX2_14/db/533_E432BEC8-63C5-4DCE-A500-90756157F30F for bucket=00_p_INDEX2_14~533~E432BEC8-63C5-4DCE-A500-90756157F30
SERVER1 09-28-2017 14:13:25.737 +0200 WARN S2SFileReceiver - unable to remove dir=/opt/splunk/var/lib/splunk/_internaldb/db/380_B3A9C962-6814-49A6-A47E-593741B331A3 for bucket=_internal~380~B3A9C962-6814-49A6-A47E-593741B331A3
SERVER2 09-28-2017 14:12:44.855 +0200 WARN S2SFileReceiver - unable to remove dir=/opt/splunk/var/lib/splunk/_internaldb/db/337_E432BEC8-63C5-4DCE-A500-90756157F30F for bucket=_internal~337~E432BEC8-63C5-4DCE-A500-90756157F30F
SERVER2 14:12:44.844 +0200 WARN S2SFileReceiver - unable to remove dir=/opt/splunk/var/lib/splunk/00_p_INDEX1_14/db/301_E432BEC8-63C5-4DCE-A500-90756157F30F for bucket=00_p_INDEX1_14~301~E432BEC8-63C5-4DCE-A500-90756157F30F
These messages only occur after a rolling-restart of the indexer cluster. It's interesting that the "Indexer Clustering Status" shows "everything is fine" and also the Health Check is not finding any issue at all.
Does this mean, all of those buckets are corrupt (there are listed around 20-30 different ones? That will be interesting to get explained by the storage people. But even if so: What does this have to do with new incoming duplicated data?
**Edit**: Two indexers are clustered with a RF=2, one indexer in datacenter X and one indexer in datacenter Y,
three Search Heads with a SF=2. The Search Heads seem to be working fine (2 in datacenter X, 1 in datacenter Y)
Skalli
↧