Deniz Saklı | 14 MIN READ

LAST UPDATED ON SEPTEMBER 24, 2024

Top 3 Tips to Improve Detection Rules for Efficient Detection Engineering

In modern Security Operations Centers (SOCs), effective threat detection relies heavily on the performance of detection rules within SIEM systems. While accurate detection is essential, the speed and efficiency with which these rules operate is equally crucial. Poorly performing rules can lead to delayed responses, missed threats, and an overload of system resources, directly impacting the overall security posture.

This blog dives deep into the factors that influence detection rule performance in SIEM environments, including rule execution time, resource utilization, and the effects of operator usage. By understanding these factors, security teams can optimize their SIEM rules to detect threats faster, minimize system strain, and improve both security outcomes and operational efficiency.

Let’s explore why optimizing detection rule performance is critical and how you can address common issues to ensure your SOC is running at peak efficiency.

The Critical Importance of Detection Rule Performance

Performance is one of the most critical factors in determining the effectiveness of a security system. Fast and efficient operation of detection rules is vital to ensuring a timely and effective response to security threats. Good performance not only ensures that threats are detected quickly but also supports optimal utilization of system resources. Among problematic rules used by SIEM users, 38% have log issues, and nearly 40% have performance issues. Due to these problems, SIEM users cannot utilize their rules effectively. You can find these details and more in the Blue Report 2024.

To understand the full impact of these issues, it’s essential to break down the key aspects of detection rule performance. In the following sections, we will examine the importance of rule performance under two main categories: speed and efficiency, and system resource utilization.

Speed and Efficiency in Detection Rules

The speed and efficiency of a detection rule determine how quickly and effectively a security system can detect and respond to threats. Security threats often occur as sudden events that need to be detected and isolated swiftly. Therefore, the speed and efficiency of detection rules play a critical role in the success of security operations.

The Importance of Speed: A rapid response to security threats minimizes the spread and damage of attacks. Fast detection allows attackers to remain in the system for less time and limits their potential damage. Response time is vital, especially in scenarios that require real-time detection and response.
The Importance of Efficiency: Efficient detection correlations reduce the number of false positives and false negatives. This allows security analysts to encounter fewer unnecessary alerts and focus on real threats. An efficient system can also perform more detections while consuming fewer resources, improving overall system performance.

Utilization of System Resources

Effective use of system resources is essential to ensure the sustainable performance of security systems. The efficient use of resources both reduces costs and improves the overall performance of the system.

Resource Management: Optimizing the resource usage of detection rules directly affects system performance. Efficient use of resources such as CPU, memory and disk ensures faster and more efficient detection processes. Resource utilization is the basis not only for the rules to run, but also for the logs that help the rules to arrive without delay. For example, optimizing compute-intensive analysis ensures that the system can allocate sufficient resources for other operations.
Scalability: Optimizing resource utilization is important for the system to cope with growing data volumes and increasing threat intensity. A scalable system can respond quickly to increasing demands and expand without performance degradation. This maintains the effectiveness of security operations in the long term.
Cost Effectiveness: Efficient use of resources reduces operating costs. Doing more with fewer resources reduces hardware and energy costs. Additionally, efficient systems prevent unnecessary expenditures, allowing budgets to be used more effectively.

Top Three Factors Affecting Detection Rule Performance

Performance is a critical factor for the effectiveness of detection rules and can be influenced by various factors. Some of these factors can be measured quantitatively, while others can only be assessed based on their impact on the system. In this context, performance can be evaluated using several metrics that can be quantified. For example, data ingestion rate, query execution time, query result size, resource utilization, data latency, and alert creation time are all measurable metrics.

To calculate the data ingestion rate in a system, for instance, you can use the following query:

SecurityAlert
| where TimeGenerated >= ago(1d)
| summarize IngestionRate = count() by bin(TimeGenerated, 1h)
| summarize AvgIngestionRate = avg(IngestionRate)

Similarly, query execution time can be easily determined through specific fields in the system, such as the ResponseDurationMs field in the LAQueryLogs table, or by using the QueryStartTimeUTC and QueryEndTimeUTC fields from the SentinelHealth table.

Figure 1. Microsoft Sentinel Rule Runtime Result

With these metrics in mind, it's important to explore the main factors that influence detection rule performance. In the following sections, we will break down key factors such as

rule execution time,
resource-consuming operator usage, and
dependency degradation,

all of which can significantly impact the efficiency and responsiveness of detection rules.

Rule Execution Time

Rule execution time, also known as rule response time, refers to how long it takes for a detection rule to identify a threat and communicate that detection to a security analyst or an automated response system. This response time is crucial for ensuring a rapid reaction to security incidents. Longer response times can delay the detection of cyber incidents, allowing attackers to remain in the system longer and potentially cause more damage.

Several factors can impact rule response time, including data processing speed, the size of the data being analyzed, and the overall system load (resource utilization). The faster the detection rule can operate, the quicker the system can identify and respond to threats. Conversely, if rule response times are slow, threats may not be detected and mitigated in time, leaving the system vulnerable and compromising its security.

Figure 2. Microsoft Sentinel Query Runtime Result

We will now discuss the factors that impact rule execution time. These include

detection rule complexity,
data volume,
resource utilization, and the
number of detection rules.

To begin, we will focus on detection rule complexity.

Detection Rule Complexity: Complex rules require more processing power, which can increase the rule's runtime. For example, using too many regex patterns or lists in the rules, or including a high number of log sources simultaneously, will increase the rule's complexity. Optimizing the rules and reducing unnecessary complexity can improve the rule's performance.

Figure 3. Microsoft Sentinel Rule Sample

Data Volume: Working with large data sets requires more time and resources. It is important to keep the data volume at manageable levels and use efficient data processing techniques. At the same time, writing the detection rule on smaller data sets will support the response time of the rule.

Figure 4. Relationship Between Data Volume And Query Runtime

Resource Utilization: Effective utilization of system resources such as CPU, memory and storage directly affects response time. Proper allocation and efficient use of resources can improve performance. Systems operating under heavy loads, where CPU and memory utilization is inefficient, will experience longer response times and may be unable to perform new operations due to their overloaded state.

Detection Rule Count: As the number of detection rules running on the system increases, the response time of each rule may increase. It can be useful to disable unnecessary rules and optimize existing rules.

Resource Consuming Operator Usage

High resource consuming operators are operators that cause detection rules to run slower than expected. These operators increase processing time, leading to delayed responses from detection rules. Slow performance can prevent detection rules from running effectively and negatively impact overall system performance.

Operators Requiring Effort: Some operators (such as Contains, '*', extract, etc.) can make writing rules much easier, but they can often cause performance degradation by consuming system resources intensively. The use of such operators should be minimized or replaced with more efficient alternatives. As can be seen in the image below [1], there are many alternative example cases.

Figure 5. Kusto Query Best Practice

Dependency Degradation

Dependency corruption refers to the degradation of external dependencies (e.g., data sources like lists or logs, or other software components like APIs) that adversely affect the performance of a detection rule. The reliability and performance of dependencies can directly impact the accuracy and speed of detection rules. Therefore, it is important to monitor dependencies and address performance issues quickly.

Slowdown of External Services (APIs, etc.): Delays in external APIs or services that are required for detection rules to work can lead to performance issues. It is important to monitor such dependencies and prepare alternatives.
Database Performance: Slow databases affect the data access and processing speed of detection rules. Optimizing database performance and using appropriate indexing methods can be beneficial.
Network Delays: Delays in data transmission over the network can degrade the performance of detection rules. Network performance should be monitored and necessary improvements should be made.
Use of Lists and Functions: The use of lists or complex functions in rules may cause performance degradation. Identifying and removing or optimizing possible unnecessary uses contributes to improved performance.

Example Scenarios: Use Cases and Solutions for Detection Rules

In order to understand and solve performance issues, it is extremely useful to work through real-world examples. In this section, we will discuss common scenarios that can be encountered and how performance can be affected in these scenarios. We will also provide practical solutions for each scenario, aiming to make detection rules work more effectively and efficiently.

The example scenarios and solutions will help you identify performance issues and develop appropriate strategies.

Use Case - Wild Time Range CPU Effect

One of the most common situations that can negatively impact the performance of detection rules is queries that cover very large time intervals. Such queries require processing large amounts of data and, as a result, can lead to over-utilization of system resources and increased response times.

How can we identify rules that cover a wide range of time periods?

Microsoft Sentinel:

SentinelHealth
| where OperationName == "Scheduled analytics rule run"
| extend QueryPeriod = tostring(ExtendedProperties.QueryPeriod)
| extend QueryFrequency = tostring(ExtendedProperties.QueryFrequency)
| distinct QueryFrequency, QueryPeriod, SentinelResourceName
| sort by QueryFrequency

This query will show you the Frequency and periods of all your rules in a very short time.

What do you do when a rule covering a wide time range is detected?

If a rule covers a large time period, it will try to process a large amount of data. This will increase both CPU usage and the time required to analyze the data. As shown in the example rule below, after changing the time range of this rule, which looks at one day of data on the last day, the rule began to analyze data with lower CPU usage and in less time.

Figure 6. Detecting The Rule With Poor Performance

The rule 'NOBELIUM - Script payload stored in Registry' was initially configured to analyze one day's worth of data once a day, which led to significant issues with both CPU usage and the volume of processed data. After making improvements, both CPU usage and the amount of scanned data have decreased significantly.

Figure 7. Improvement Results Of The Rule With Poor Performance

Use Case - Resource Consuming Operator Usage

Another important factor that can negatively affect the performance of detection rules is query slowing patterns, also known as query operators. These operators may differ for each SIEM platform. Examples include operators like Contains, EndsWith, StartsWith, and Regex. The use of these operators is sometimes mandatory. However, using too many of them in a query or using them more frequently than needed can cause delays in the results.

How can we detect the operators that slow down the query from within the rules?

Actually, we may not be able to directly achieve the desired results, but there is an easy way to find them in our alert rules. In this example, we will create our query over the 'Contains' operator. The query can be changed for different operators as needed.

Microsoft Sentinel:

SecurityAlert
| extend Query_ = tostring(parse_json(ExtendedProperties).Query)
| extend QueryCount_Contains = countof(Query_, "contains")
| where QueryCount_Contains > 7 //The number here can change according to your system.
| distinct QueryCount_Contains, DisplayName

This query, it will be possible to easily detect the usage above a certain number.

What should be done when a rule that uses operators that slow down the query is detected?

In these cases, you can replace your query with operators that will spend less effort, or if you have the opportunity, you can prepare queries that will match directly without these operators.

When the above query is run, it is detected that 8 “contains” statements are used in the “System Information Discovery” rule. According to Microsoft Sentinel's “Best practices for Kusto Query Language queries” document, it is recommended to use “has” operator instead of “contains”.

Figure 8. Query CONTAINS Count

Figure 9. Kusto Query Best Practice For CONTAINS Operator

After the modifications, unnecessary “contains” operators have been removed or replaced and the rule has improved in terms of performance.

Use Case - More than One Day Query Period and More than 10 Hits

Some detection rules analyze large time intervals. As detailed in the previous use case scenario, this can lead to high CPU utilization and cause the rule to generate alerts with delays. This will prevent the rules from working more efficiently. Since these rules will sometimes generate alerts for each log, it will also be a negative example in terms of alert logs to the system.

How can we detect alerts with a period of more than one day and a hit count more than 10 above the Trigger Threshold?

When you analyze the logs of the rules that have triggered an alarm, the log shows the number of hit counts. This hit count contains various clues that indicate that the performance of the system is under negative impact.

For example; A rule with a Trigger Threshold of 1 has a hit count of more than 10. In this case, the following comments can be made for this rule.

It uses a wide time window, negatively impacting CPU, RAM, and disk usage.
It generates a high number of false alarms due to an erroneous query.
The log triggering the alarm occurs multiple times, resulting in duplicate alerts.

Now, let us take a look at how the query looks on Microsoft Sentinel.

Microsoft Sentinel:

SecurityAlert
| where ProductComponentName == "Scheduled Alerts"
| extend end_time = todatetime(parse_json(ExtendedProperties).["Query End Time UTC"])
| extend start_time = todatetime(parse_json(ExtendedProperties).["Query Start Time UTC"])
| extend diff_day = datetime_diff('day',end_time,start_time)
| where diff_day > 1
| extend result_count = toint(parse_json(ExtendedProperties).["Search Query Results Overall Count"])
| extend threshold = toint(parse_json(ExtendedProperties).["Trigger Threshold"])
| where result_count > threshold + 10 // The number here can change according to your system.

When the query above is run, we can quickly detect rules with a time period of more than one day and 10 or more hits above the threshold value. Here, the number 10 is chosen as an example. Depending on the SIEM you are using, this number can be adjusted.

What should be done in such a case?

When such a situation is detected, first, if possible, the log time interval that the rule will check is changed. By selecting shorter time intervals, an event can be detected in a shorter time. Secondly, all triggers of the rule should be examined to determine whether it is a momentary issue. This helps identify when the problem started. Thirdly, the threshold value of the rule should be checked to ensure that it is correct. This check allows updating the threshold value determined according to the analysis made at the time of the initial creation of the rule. The final step is to check the incoming logs in terms of time and content and check if there is any repetitive situation.

For example, we found that the “Suspicious Mimikatz Usage” rule looks at data from the last 12 days and the threshold value is greater than 0, meaning that it will generate an alarm on any log.

Figure 10. Detection Of Ineffective Rule Usage

When we evaluated the checkpoints mentioned above, we found that the rule was running in an incorrect time interval. This allowed the rule to detect a suspicious situation in a shorter time.

Conclusion

Security devices such as SIEM and EDR are vital for fast and accurate threat detection as well as timely response. Optimizing performance not only enables faster response to security threats but also allows for more efficient use of system resources. In the long run, this reduces costs and improves the overall effectiveness of security operations. Ultimately, the success of security operations depends on continuous performance monitoring and improvement.

It's important to note that SIEM systems are complex and constantly changing. As a result, developing and maintaining automation to address potential issues will require significant time and effort.

The Picus Detection Rule Validation (DRV) module identifies broken and inefficient detection rules within your SIEM. It has a continuously updated checklist for identifying these detection rules and uncovers the root causes behind rules not performing as expected. By doing so, it saves significant time and effort compared to manual operations while also enhancing the effectiveness and efficiency of security controls. To use DRV, simply complete the integration process with your SIEM, which takes only a few minutes.

You can request a demo to identify issues related to SIEM rules and gain insights to enhance your rules here.

Reference

[1] “Best practices for Kusto Query Language queries.” Available: https://learn.microsoft.com/en-us/kusto/query/best-practices?view=microsoft-fabric. [Accessed: Sep. 18, 2024]