Ingest Delay Variance: A Race Condition Effecting Modern SIEM Systems

T‍his is a description of a class of issues in SIEM systems that are not accounted for in many products. It’s relevance to this website is as a portfolio item. It may also be useful for SIEM users to avoid creating queries effected by the issue. I have reported the issue and suggested related warnings in documentation to one vendor: https://github.com/MicrosoftDocs/azure-docs/pull/122521.

—————————————————————————————————————-

Index time delays are a known issue in query design: https://learn.microsoft.com/en-us/azure/sentinel/ingestion-delay. The problem is effectively that you are often trying to fix a time window over which you would like to consider events by using a concept of “scheduled searches” with pre-configured time ranges that are relative to the time the search is set to execute.

The simplest case is where you scheduled a search to run every N units of time and it looks back over N units of time of data (where the time between runs and the time over which you look back are equal). You wouldn’t think there would be any problems with this design, but it’s not quite true. Search processing systems are quite complex and layered (sort of like database management systems). Because they are so complicated it can take quite a long time between when an event occurs on a system and when that event actually makes it into the event processing systems data storage component and becomes available for querying.

There’s a lot of “stuff” between the point of the system on which the event was generated and the data store in which the event process system stores the events: the operating system of the system which generates the event, the network, the system(s) on which the event processing system lives and finally the event processing system’s internals.

Because of all that there can be a delay between the time the event is actually generated and the time the event is available for querying. Because of this an event with an actual time of, say, 10:12 might not exist in the event process system’s data stores until 10:22. If you scheduled a search every 15 minutes and look back 15 minutes, and it that search kicked off at a cadence of 10:00, 10:15, 10:30, etc., then you’d miss this event.

The fix for this, however, is a known design pattern. Microsoft’s documentation contains a good description of how to solve this issue in it’s simplest state. However, if you want to correlate on more than a single event, things get complicated. With two events you aren’t just dealing with a single index time delay, you’re dealing with two index time delays, and they may be different.

Consider the case below, where two alerts, a1 and a2, are intended to be correlated together based on the time the event actually occurred on the system. The search uses the design pattern above, executing every 15 minutes and only looking at events that have been indexed in the last 15 minutes.

Suppose a1 and a2 have the following ingestion and actual event times:

a1 raw event time: 10:09:03, a1 ingestion time: 10:15:03

a2 raw event time: 10:10:00, a2 ingestion time: 10:13:00

And the search executes at 10:00, 10:15 and 10:30.

In the case above event a1 has a raw event time (the time the event actually occurred on the system) of 10:09:03 and an ingestion time (the time in which the event was ingested) of 10:15:03 and a2 has a raw event time (the time the event actually occurred on the system) of 10:10:00 and an ingestion time (the time in which the event was ingested) of 10:13:00.

If the ingestion delay were the same for both events the design pattern recommended above would suffice to account for the delay. However, given the ingestion delay varies between the two events the search would fail to pick up the case above as a1 would only be considered by search execution that started at 10:30 while a2 would only be considered by the search execution that started at 10:15.

‍ ‍

Put generally, because the ingestion delay may vary between the two events, it’s still possible, even though both events will still be considered by a search at some point, that, if the time they are generated and indexed falls close to the time a search will be kicked off, that they are considered by different search execution and therefor correlation does not occur.