This is more of a question about the "right" way of doing things versus what is possible.
I want to know if there is anything I am forgetting or not considering that will make the following solution problematic. I have never seen this documented or discussed in any Splunk documentation, apps, or forums, so I wanted to make sure there is a reason for its absence that I did not know about.
The scenario I have is the need to handle a large set of sensor data (> 15 fields) from thousands of endpoints (i.e., GB of data per day). The sensor data is periodically sampled, and I typically always look at averages, minimums, maximums, and weighted averages in 5-minute intervals.
This seems like a good place to use summary indexing instead of datamodels/pivot, so that is the path I went down.
The issue I have is there is a lot of disk space wasted due to how summary aggregation fields (psrv*) from sistats are written to a summary index in the format "Field=Value". In some cases, I actually see errors because the _raw field is too big (if I compute avg, min, and max on all sensor fields).
The solution I devised to get around this (and to be more efficient) is writing the summary data from sistats out in "|" delimited raw events that look like the following (the numbers represent sistats output for my sensor data).
Timestamp|Search_Time|Endpoint_Name|Sensor_Location|5|5|5|5|5|5|5|5|5|5|5|5|5|423|13|150966|0|1782.1|426|14|1514905|0|0|0|2123|...
I then defined a new source type for my summary index that specifies the appropriate field names for the "|" delimited summary statistics fields (prsvd_*, etc.).
This seems to work fine in terms of retrieving and processing the summary index data, and it saves around 25% of disc space.
So, is this OK to do for a large-scale deployment? Is there other things I need to consider? Is there a better solution that is more maintainable?
↧