We recently set up a new Splunk environment with one search head, multiple indexers, and one heavy forwarder. They're all running version 6.3.0. We have data sent to the heavy forwarder that forwards data to the indexer layer.
Since we started indexing data (~200GB daily volume), I have already encountered quite a few incidents that a search took very long to respond. The search was not dead and it did return some result. From the Inspect Job, I can see that one indexer was not returning me any result as a dispatch.stream.remote.<indexer_name> output count was missing for that indexer. The search job did reported some timing out related to socket error from peers but there are no errors linking it to bucket/index issues.
For example, I have a search took 15 hrs to respond to me (alert me when it's done). Here is the log showing the time started at 1:00am and ended at 15:59.
10-29-2015 01:00:04.050 INFO dispatchRunner - initing LicenseMgr in search process: nonPro=0
10-29-2015 01:00:04.051 INFO dispatchRunner - registering build time modules, count=1
10-29-2015 01:00:04.051 INFO dispatchRunner - registering search time components of build time module name=vix
10-29-2015 01:00:04.051 INFO dispatchRunner - Splunkd starting (build aa7d4b1ccb80).
10-29-2015 01:00:04.051 INFO dispatchRunner - System info: Linux, us1455splksh01.houston-us1455.slb.com, 2.6.32-504.23.4.el6.x86_64, #1 SMP Fri May 29 10:16:43 EDT 2015, x86_64.
10-29-2015 01:00:04.051 INFO dispatchRunner - Detected 40 (virtual) CPUs, 20 CPU cores, and 64375MB RAM
10-29-2015 01:00:04.051 INFO dispatchRunner - Maximum number of threads (approximate): 32187
......
......
10-29-2015 01:10:04.807 ERROR HttpClientRequest - HTTP client error: Read Timeout (while accessing https://:8089/services/streams/search?sh_sid=scheduler__swong2_U0xCX0lUX1BlcmltZXRlcl9TZWN1cml0eQ__RMD5d5a127b32213fcba_at_1446080400_384)
10-29-2015 01:10:04.807 WARN SearchResultParserExecutor - Socket error during transaction. Timeout error. for collector=
10-29-2015 01:10:10.626 ERROR HttpClientRequest - HTTP client error: Read Timeout (while accessing https://:8089/services/streams/search?sh_sid=scheduler__swong2_U0xCX0lUX1BlcmltZXRlcl9TZWN1cml0eQ__RMD5d5a127b32213fcba_at_1446080400_384)
10-29-2015 01:10:10.626 WARN SearchResultParserExecutor - Socket error during transaction. Timeout error. for collector=
10-29-2015 01:20:10.744 ERROR HttpClientRequest - HTTP client error: Read Timeout (while accessing https://:8089/services/streams/search?sh_sid=scheduler__swong2_U0xCX0lUX1BlcmltZXRlcl9TZWN1cml0eQ__RMD5d5a127b32213fcba_at_1446080400_384)
10-29-2015 01:20:10.744 WARN SearchResultParserExecutor - Socket error during transaction. Timeout error. for collector=
10-29-2015 15:59:12.781 ERROR HttpClientRequest - HTTP client error: Connection closed by peer (while accessing https://:8089/services/streams/search?sh_sid=scheduler__swong2_U0xCX0lUX1BlcmltZXRlcl9TZWN1cml0eQ__RMD5d5a127b32213fcba_at_1446080400_384)
10-29-2015 15:59:12.782 WARN SearchResultParserExecutor - Socket error during transaction. ReadWrite error. for collector=
.....
.....
10-29-2015 15:59:12.966 INFO ShutdownHandler - shutting down level "ShutdownLevel_Queue"
10-29-2015 15:59:12.966 INFO ShutdownHandler - shutting down level "ShutdownLevel_Exec"
10-29-2015 15:59:12.966 INFO ShutdownHandler - shutting down level "ShutdownLevel_CallbackRunner"
10-29-2015 15:59:12.967 INFO ShutdownHandler - shutting down level "ShutdownLevel_HttpClient"
10-29-2015 15:59:12.967 INFO ShutdownHandler - Shutdown complete in 625 microseconds
I suspected bucket issues on my indexers. I ran >splunk fsck scan --all-buckets-all-indexes on all 5 indexers and they all reported similar errors on different indexes like following:
Error reading Manifest file inside "/opt/splunk/var/lib/splunk/pan_logs/db/db_1445859145_1445855519_208/rawdata": discontinuity in journal between 358642577 and 358774108
Corruption: Cannot get slices.dat count
Error reading Manifest file inside "/opt/splunk/var/lib/splunk/sbr/db/db_1446050336_1445929226_23/rawdata": discontinuity in journal between 820385205 and 820516896
Corruption: Cannot get slices.dat count
Corruption: count mismatch tsidx=4123501 host-metadata=3642145
I have tried to repair (fsck repair) one indexer and it seems fixed one of my search problem. But the problem is that I keep seeing bucket corruption happened every day.
Anybody experience similar problem on 6.3.0? Or can point me to the right direction to troubleshoot whether it is a software or hardware problem?
↧