So I'm new to the Machine Learning Toolkit and I'm trying to model something that I thought would be somewhat straightforward, but I'm beginning to realize that I might need more of an understanding of what Splunk needs from me to create an accurate report.
**What I want is a model that will report any event_id that reports an uncharacteristic volume of events.** I started off by throwing as many stat commands I could at the data but I ended up with a model that may have modeling something completely different. I simplified the search to the following:
base search earliest=-30d | bin _time span=1h as event_window | stats sum(count) avg(count) dc(hostname) by event_id event_window
So at this point, I have several questions:
**Do I include the event_id in the k means algorithm?** I've been told that the answer you hope to find should not be included in the clustering but here I feel like it's necessary to describe the data. In my mind, we have to attribute the data point to an event_id or the frame of reference is lost. Am I correct?
**The bin/timespan.** Right now I'm group everything by an hour because I anticipate running a saved search over this data on an hourly basis. Do I have to have like for like values here, or can I potentially run the clustering with different bucket spans and still obtain accurate reports?
**Is this even a decent model?** Like I said previously, I went nuts with multiple stats commands, deltas, etc but felt like I began modeling something else instead of volume of event_ids. I have run this model and tested with some data out in the wild and I see outliers but I don't understand why Splunk is reporting them as such. I'm not sure if the variance in the amount of event_ids overall, but these numbers are well within the limits I would have expected to be set in the clustering. Are there stats that I should be including to address this?
Again, I am a super newb at this. I've looked for a primer to ML in Splunk but I haven't found anything that goes into this level of explanation. Any assistance would be greatly appreciated.
↧