I'm using `streamstats` to pair up events by username so that timestamps, IP's, latitudes, and longitudes can be analyzed for land-speed violations as a possible indicator of account compromise. However, I'm running into an issue where this works perfectly on small datasets (thousands of events), and even some large datasets (millions of events) . . . but when the number of users represented in the event set climbs into the hundreds of thousands, `streamstats` seems to be dropping events. I have a few conjectures, but I would like to understand what specific limitation it is running into, and the best way to work around it.
The relevant portion of the full query is as follows:
| streamstats current=t global=f window=2 earliest(client_ip) as client_ip_1 latest(client_ip) as client_ip_2 earliest(_time) as time_1 latest(_time) as time_2 earliest(timestamp) as timestamp_1 latest(timestamp) as timestamp_2 earliest(latlon) as latlon_1 latest(latlon) as latlon_2 by username
I collected the following data, which may be of some use:
+ Can handle 100,000 events and 100,000 users
+ Can handle 200,000 events and 50,000 users
+ Can handle 500,000 events and 50,000 users
+ Can handle 1,000,000 events and 50,000 users
+ Can handle 3,000,000 events and 50,000 users
+ CANNOT handle 200,000 events and 150,000 users
The above leads me to believe that the number of groups the "by" clause forces streamstats to split the event set into is the biggest factor. However, any more concrete information you can provide would be greatly appreciated.
↧