Quantcast
Channel: Questions in topic: "splunk-enterprise"
Viewing all articles
Browse latest Browse all 47296

Best practice for representing bit flag fields in input data?

$
0
0
Suppose I have a field that consists of a byte value, where each bit can represent a "flag": a property whose value is either true or false. In the definition of the record layout, the "parent" field (the byte) has a name, and so does each of the "child" bit flags. For example, suppose I have a field named `toppings` that occupies one byte, where each bit represents whether or not a particular topping was added to a pizza: 1. `anchovies` 2. `bacon` 3. `chilli` 4. `mushrooms` 5. `olives` 6. `pepperoni` (These names are fictional, but the structure matches actual fields in my data.) Two of the bits are currently unused. Now suppose I have the freedom to format that data in any way I choose before I get it into Splunk. Some considerations: - Should I bother including the original byte value, as a number? I'm tending towards "no", but suppose (I know, there's a lot of *supposing* going on here) we have zillions of these records, and for the foreseeable future we're only interested in whether the toppings included `bacon` or `mushrooms`, but there's a slim chance we might at some point also be interested in the others... so maybe we only break out `bacon` and `mushrooms` as separate properties for now. This runs the risk of forgetting what the other bits mean, or that their meaning has changed over time ... but it's cheaper to index fewer fields, unless you have an "all you can ingest" license. - Should the data be "sparse" or "dense"? Let's say we've decided that we're interested in all of the toppings, and that the absence of a flag means "false". One problem: record formats can change over time; new flags can appear in data, and existing flags can become obsolete. If we introduce new toppings (say, `onion` and `capers`) and we've assumed that the absence of a flag means "false", then, when we analyze our data, if we don't keep in mind when onions and capers became available, we might mistakenly think that pizza eaters before a certain date eschewed those toppings. We've lost the distinction between "false" and absent (or `null`). (More realistically, for my use case: we might mistakenly think that a particular software property was "false", when in fact that property did not even exist in the version of the software that created the log record.) - If I use a data format such as JSON that supports nested structures, should I nest the bit flags under their parent, or should I keep a flat structure? Some examples: #### Example 1: dense JSON, nested All available toppings represented. "toppings": { "anchovies": true, "bacon": true, "chilli": true, "mushrooms": true, "olives": false, "pepperoni": false } #### Example 2: dense JSON, flat "toppings_anchovies": true, "toppings_bacon": true, "toppings_chilli": true, "toppings_mushrooms": true, "toppings_olives": false, "toppings_pepperoni": false #### Example 3: sparse JSON, nested No overall byte value; only "true" properties present (others assumed "false"; literally, missing): "toppings": { "bacon": true, "mushrooms": true } #### Example 4: sparse JSON, flat There might have been other toppings. "toppings_bacon": true, "toppings_mushrooms": true #### Example 5: sparse JSON, flat, with original byte value There were other toppings than bacon and mushrooms, but you'd have to know how to interpret the byte value 240. "toppings": 240, "toppings_bacon": true, "toppings_mushrooms": true ## Summary I think the "sparse" options (especially, where missing means false) are asking for trouble, but I thought I'd at least mention these options, because indexing data costs money. So I think it's down to the "dense" options. In which case, I don't see the point in indexing the original numeric byte value. But nested or flat? Nested means less data ingested (less repetition of the `toppings` qualifier), and I don't see any problems referring to nested properties such as `toppings.anchovies`. But if I choose nested, then I think that rules out offering users the freedom of choice to ingest from either CSV or JSON, and then being able to use the same search strings in Splunk regardless of the input data format. Because the data ingested from CSV won't have the nested structure `toppings.anchovies`. Thoughts and advice welcome.

Viewing all articles
Browse latest Browse all 47296

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>