Hi all,
I've noticed some weird behavior on one (and only one) of my indexers. A customer complained about data "suddenly disappearing" in the middle of the day. When my team investigated, they found that all the buckets in cold resembled "frozen" buckets - they contain only a rawdata directory with journal.gz inside. When I search for a representative bucket in the logs, I see only the following two messages, repeated over and over again:
12-30-2015 00:13:53.533 +0900 INFO DbMaxSizeManager - Will freeze bucket=/opt/splunk/var/lib/splunk/jcare/colddb/db_1449606229_1449572203_26083 LT=1449606229 size=30715904 (29MB)
12-30-2015 00:13:55.068 +0900 ERROR BucketMover - aborting move because recursive copy from src='/opt/splunk/var/lib/splunk/jcare/colddb/db_1449606229_1449572203_26083' to dst='/opt/splunk/var/lib/splunk/frozen/jcare/inflight-db_1449606229_1449572203_26083' failed (reason='No space left on device')
The disk space error referenced is due to a known problem with our backups client that we're working on. However, that still leaves me with two questions:
1. If Splunk fails to move a bucket to frozen, why does it continue freezing other buckets in cold? Shouldn't it just stop altogether?
2. If Splunk fails to move a bucket to frozen, why does it leave the bucket in "frozen" status on cold? My understanding was that the bucket was first moved, then frozen.
Curious if anyone else has seen this and has suggestions as to why this is happening. Again, the root cause appears to be disk-related, but we are dependent upon a support ticket with our Backups vendor to resolve this, and I'd love to see any ameliorative suggestions that we can implement in the meantime.
↧