Dear fellow Splunkers,
I have seen the [docs](http://docs.splunk.com/Documentation/Splunk/6.1.3/Indexer/Indextimeversussearchtime) on index-time field extractions and a few related answers [here](https://answers.splunk.com/answers/67170/index-time-field-extraction.html), [there](https://answers.splunk.com/answers/57247/index-time-field-extraction.html) or [there](https://answers.splunk.com/answers/5817/search-time-versus-index-time-field-extractions.html) with the general guidance that an index-time extraction is rarely ever needed or beneficial.
However, I have a dedicated index that holds Apache logfiles for a lot of different virtual hosts. I have set up search-time field extractions to get the apache_virtualhost and HTTP status code.
Now following search
index=web apache_virtualhost=some.virtual.host | timechart count by status
is _very_ slow and does not complete even after a few minutes, keeping the CPU 100% busy. However,
index=web source=/path/to/logs/for/this/vhost-only.log | timechart count by status
returns a result in an acceptable amount of time.
Being used to relational DBs, I immediately thought "sure, in the second case Splunk can retrieve the small subset of matching rows from the index, whereas the first case needs to push _all_ rows through the regexp first". But that is probably not the way Splunk works.
So, is there an simple explanation for this? Have I found one of the rare cases where index-time field extraction would make sense?
Thanks for sharing your insights!
↧