Health Status Page

The Health Status page is an on-demand assessment of the Exabeam pipeline. It is broken down into three categories:

General Health - General health tests that all of the back-end services are running - database storage, log feeds, snapshots, CPU, and memory.

Connectivity - Checks that Exabeam is able to connect to external systems, such as LDAP and SIEM.

Log Feeds - This section reports on the health of the DC, VPN, Security Alerts, Windows Servers, and session management logs.

In all of the above areas, GREEN indicates the status is good, YELLOW for a warning, and RED if the system is critical.

Located on the homepage are the Proactive Health Checks that will alert administrations when:

• Any of the core Exabeam services are not running

• There is insufficient disk storage space

• Exabeam has not been fetching logs from the SIEM for a configurable amount of time

Proactive and On-Demand System Health Checks

System Health is used to check the status of critical functionality across your system and assists Exabeam engineers with troubleshooting. Exabeam provides visibility on the backend data pipeline via Health Checks. Graphs and tables on the page visually represent the health status for all of the key systems, as well as indexes and the appliance, so you are always able to check statuses at a glance and track health over time.

Proactive health checks run automatically and periodically in the background.

On-demand health checks can be initiated manually and are run immediately. All newly gathered health check statuses and data is updated in the information panes on the page. All proactive and on-demand health checks are listed on the Health Checks page. Proactive health checks are visible by any user in your organization. Only users with administrator permission can reach the page.

When a health check is triggered, a notification message is displayed in the upper right corner of the UI. Select the alert icon to open a side panel that lists the alerts and provides additional detail. A panel listing all notifications is expanded.

These alerts are also listed under the Health Alerts tab in the System Health page. In general:

• Warning: There is an issue that should be brought to the attention of the user.

To reach the Health Checks page, navigate to the System Health page from the Settings tab at the top right corner of any page, then select the Health Checks tab.

Health check categories are:

• Service Availability – License expiration, database, disaster recovery, Web Common application engine, directory service, aggregators, and external connections

• Node Resources – Load, performance, and retention capacity

• Service Availability (Incident Processors and Repositories) - IR, Hadoop, and Kafka performance metrics

• Log Feeds – Session counts, alerts, and metrics

• System Health Information – Core data and operations processor metrics

• Elasticsearch Storage (Incident Responder) – Elasticsearch capacity and performance metrics

Configure Alerts for Worker Node Lag

When processing current or historical logs, an alert will be triggered when the worker node is falling behind the master node. How far behind can be configured in /opt/config/exabeam/tequila/custom/health.conf. The parameters are defined below:

• RTModeTimeLagHours - During real-time processing the default setting is 6 hours.

• HistoricalModeTimeLagHours - During historical processing the default setting is 48 hours.

• syslogIngestionDelayHour - If processing syslogs, the default setting is 2 hours.

}

slaveMasterLagCheck {

printFormats = {

json = "{ \"lagTimeHours\": \"$lagTimeHours\", \"masterRunDate\": \"$masterRunDate\", \"slaveRunDate\": \"$slaveRunDate\", \"isRealTimeMode\": \"$isRealTimeMode\"}"

plainText = "Worker nodes processing lagging by more than $lagTimeHours hours. Is in real time:$isRealTimeMode "

}

RTModeTimeLagHours = 6

HistoricalModeTimeLagHours = 48

}

limeCheck {

}

System Health Checks

Martini Service Check: Martini is the name Exabeam has given to its Analytics Engine. In a multi-node environment, Martini will be the Master node.

Tequila Service Check: Tequila is the name Exabeam has given to its User Interface layer.

Lime Service Check: LIME (Log Ingestion and Message Extraction) is the service within Exabeam that ingests logs from an organization's SIEM, parses and then stores them in HDFS. The main service mode parses message files and creates one message file per log file. This mode is used to create message files that will be consumed by the main node.

Mongo Service Check: MongoDB is Exabeam's chosen persistence database. A distributed MongoDB system contains three elements: shards, routers, and configuration servers (configsvr). The shards are where the data is stored; the routers are the piece that distributes and collect the data from the different shards; and the configuration servers which tracks where the various pieces of data are stored in the shards.

Zookeeper Service Check: Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. In a distributed multi-node environment, we need the ability to make a value change inside one process on a machine and have that change be seen by a different process on a different machine. Zookeeper provides this service.

Hadoop Service Check: Master - Hadoop is Exabeam's distributed file system, where the raw logs and parsed events are stored. These files are available to all nodes.

Ganglia Service Check: Ganglia is a distributed monitoring system for computing systems. It allows us to view live or historical statistics for all the machines that are being monitored.

Available on the System Health page, the Storage Usage tab provides details regarding the current data retention settings for your Advanced Analytics deployment. Advanced Analytics sends notifications when available storage capacity dwindles to a critical level. Admins have the option to enable and configure automatic data retention and data purging for both HDFS and MongoDB usage.

Default Data Retention Settings

The table below lists default retention periods for data elements in your Advanced Analytics deployment:

Element

Description

Default Retention Period

Logs

Original logs

90 days

Events on Disk

Parsed events on disk

180 days

Events in Database

Parsed events in the database

180 days

Triggered Rules and Sessions in Database

Container and triggered rule collections in the database

365 days

System Health Alerts for Low Disk Space

Get notified when your disk is running low on space.

Your Advanced Analytics system may go down for several hours when it upgrades or restarts. During this down time, your log source continues to send logs to a disk, accumulating a backlog of unprocessed logs. Log Ingestion and Messaging Engine (LIME) tries to process this backlog when it starts running again. If your log source sends logs faster than what your system size can handle, LIME may struggle to process these logs, run out of disk space, and stop working correctly.

If LIME is running, you receive two system health alerts. When the disk has 25 percent capacity remaining, the first health alert notifies you that you're running low on disk space. In the rare case that your disk has 15 percent capacity remaining, a second health alert notifies you that your system has deleted files, starting with the largest one, as a last resort to keep your system running.

If LIME goes down, Advanced Analytics can't send health alerts. When LIME is running again, you may receive a belated health alert. In the rare case that your disk reached 15 percent remaining capacity while LIME was down, this health alert notifies you that your system has deleted files, starting with the largest one, as a last resort to keep your system running.

When you receive these health alerts, consider tuning your system so it ingests less logs or ingests logs more slowly. If you ingest logs from Data Lake, consider setting a lower log forwarding rate.

System Health Alerts for Paused Parsers

Get notified automatically when Advanced Analytics pauses a parser.

To protect your system from going down and ensure that it continues to process data in real time, Advanced Analytics detects when a parser is taking an abnormally long time to parse a log, then pauses it.

When Advanced Analytics pauses a parser, you receive a health alert that describes which parser was paused, on which node, and recommended actions you could take. When Advanced Analytics pauses additional parsers, it resolves the existing health alert and sends a new alert that lists all paused parsers.

Advanced Analytics also resolves an existing health alert any time you restart the Log Ingestion and Messaging Engine (LIME) and resume a paused parser. For example, if you change the threshold for when parsers get paused, you must restart LIME, which resumes any paused parsers and prompts Advanced Analytics to resolve any existing health alerts. If a resumed parser continues to perform poorly, Advanced Analytics pauses it again and sends another health alert about that parser.

View Storage Usage and Retention Settings

By default, Advanced Analytics implements data retention settings for logs and messages. This allows the system to automatically delete old data, which reduces chances of performance issues due to low repository space.

Available on the System Health page, the Storage Usage tab provides details regarding the current data retention settings for your Advanced Analytics deployment, including:

• HDFS Data Retention – Number of days the data has been stored, the number of days remaining before it is automatically deleted, and a toggle to enable/disable auto-deletion of old data.

Note

If enabling or disabling Auto-delete Old Data, you must also restart the Log Ingestion Engine before the change goes into effect. For more information on restarting the Log Ingestion Engine, please refer to the Restart the Analytics Engine section.

A warning message appears if the average rate of HDFS usage exceeds the intended retention period.

• HDFS Usage – Total HDFS usage, including specific breakdowns for logs, events on disk, and events in database .

• MongoDB Retention – A dialog to set retention settings and a toggle to enable/disable disk-based deletion of old data.

You can edit the MongoDB data retention settings by clicking the pencil icon.

The capacity used percentage threshold is set to 85% by default, with a maximum value of 90%. It helps to prevent MongoDB from reaching capacity before hitting the default retention period threshold. As soon as MongoDB meets the percentage on any node, the system will start purging events until it is back below the capacity used threshold.

In addition to the capacity used percentage, Advanced Analytics keeps 180 days of event data in MongoDB and 365 days of triggered rules and containers data by default.

Note

The Days for Triggered Rules & Sessions value cannot by less than the Days for Events value.

• MongoDB Usage – Total MongoDB usage.

Set Up System Optimization

This tab is a single aggregated page for auditing and viewing disabled data types (including models, event types, and parsers) and system load redistribution.

Disabled Models

When a model takes up too much memory, it is disabled. If you enable these models, the system may suffer performance issues.

Note

Exabeam disables models as a last resort to ensure system performance. We have other defensive measures to prevent models from using all of the system's memory. These measures include enabling model aging and limiting the model bin size count. If these safeguards cannot prevent a model from consuming too much memory, the system prevents itself from shutting down as it runs out of memory.

Configure Model Maximum Bin Limit
Hardware and Virtual Deployments Only

Models are disabled once they reach their maximum bin limit. You can either set a global configuration for the maximum number of bins or an individual configuration for each model name.

This is done by setting a max number of bins, MaxNumberOfBins in exabeam_default.conf or the model definition, for categorical models. The new limit is 10 million bins, although some models, such as ones where the feature is "Country" have a lower limit. We have put many guardrails in place to make sure models do not consume excessive memory and impact overall system health and performance. These include setting a maximum limit on bins, enabling aging for models, and verifying data which goes into models to make sure it is valid. If a model is still consuming excessive amounts of memory then we will proceed to disable that model.

To globally configure the maximum bin limit, navigate to the Modeling section located in the exabeam_default.conf file at /opt/exabeam/config/default/exabeam_default.conf.

All changes should be made to /opt/exabeam/config/custom/custom_exabeam_config.conf:

Modeling {
...
# To save space we limit the number of bins saved for a histogram. This defaults to 100 if not present
# This parameter is only for internal research
MaxSizeOfReducedModel = 100
MaxPointsToCluster = 250
ReclusteringPeriod = 24 hours
MaxNumberOfBins = 1000000

Additionally, it is possible to define a specific bin limit on a per model basis via the model definition section in the models.conf file at /opt/exabeam/config/custom/models.conf. Here is an example model, where MaxNumberOfBins is set to 5,000,000:

WTC-OH {
ModelTemplate = "Hosts on which scheduled tasks are created"
Description = "Models the hosts on which scheduled tasks are created"
Category = "Asset Activity Monitoring"
IconName = ""
ScopeType = "ORG"
Scope = "org"
Feature = "dest_host"
FeatureName = "host"
FeatureType = "asset"
ModelType = "CATEGORICAL"
BinWidth = "5"
MaxNumberOfBins = "5000000"
AgingWindow = ""
CutOff = "10"
Alpha = "2"
ConvergenceFilter = "confidence_factor>=0.8"
HistogramEventTypes = [
]
Disabled = "FALSE"
}

Paused Parsers

If your parser takes too long to parse a log and meets certain conditions, Advanced Analytics pauses the parser.

To keep your system running smoothly and processing data in real time, Advanced Analytics detects when a parser is taking an abnormally long time to parse a log, then pauses it.

To identify a slow parser, your system places the parser in a cache. Every configured period, OutputParsingTimePeriodInMinutes (five minutes by default), it calculates how long it takes, on average, for the parser to parse a log. It compares this average to a configurable threshold, ParserDisableThresholdInMills, in lime.conf.

To calculate a percentage of how much the parser makes up of the total parsing time, your system divides the previously-calculated average by the total time all parsers took to parse an event in the same five minute period. It compares the percentage to another configurable threshold, ParserDisableTimePercentage, in lime.conf.

Your system conducts a second round of checks if the parser meets certain conditions:

• The average time it takes for the parser to parse a log exceeds a threshold, ParserDisableThresholdInMills (seven milliseconds by default).

• The parser constitutes more than a certain percentage, ParserDisableTimePercentage (50 percent by default), of the total parsing time of all parsers.

During another five-minute period, your system checks the parser for a second time. If the parser meets the same conditions again, your system pauses the parser.

When a parser meets these conditions on a Log Ingestion and Messaging Engine (LIME) node, it is paused only on that node. If you have multiple LIME nodes, it is not automatically paused on all nodes unless it meets these conditions on every node.

If you have a virtual or hardware deployment, you can configure OutputParsingTimePerioidInMinutes, ParserDisableThresholdInMills, and ParserDisableTimePercentage to better suit your needs. If you have a cloud-delivered deployment, contact Exabeam Customer Success to configure these variables.

View Paused Parsers

To keep your system running smoothly and processing data in real time, Advanced Analytics detects slow or inefficient parsers, then pauses them. View all paused parsers in Advanced Analytics, under System Health., or in your system.

View Paused Parsers in Advanced Analytics
1. In the navigation bar, click the menu , then select System Health.

2. Select the System Optimization tab.

3. Select Paused Parsers.

View a list of paused parsers, sorted alphabetically by parser name, and information about them, including:

• Parser Name – The name of the paused parser.

• Average Log Line Parse Time – Average time the parser took to parse each event.

• Paused Time – Date and time when the parser was paused.

View Paused Parsers in Your System

If you a hardware or virtual deployment, you can view paused parsers in disabled_parsers_db.current_collection, located in the MongoDB database. Run:

bash\$ sos; mongo
mongos> use disable_parser_db
mongos> db.current_collection.findOne()

For example, a paused parser collection may look like:

{
"_id" : ObjectId("5cf5bb9e094b83c6f7f89b86"),
"_id" : ObjectId("5cf5bb9e094b83c6f7f89b86"),
"parser_name" : "raw-4625",
"average_parsing_time" : 9.774774774774775,
"time" : NumberLong(1559608222854)
}
Configure Conditions for Pausing Parsers
Hardware and Virtual Deployments Only

Change the conditions required for Advanced Analytics to pause slow, stuck, or failed parsers.

Change how Advanced Analytics defines and identifies parsers that are taking an abnormally long time to parse logs. You can change three variables: OutputParsingTimePeriodInMinutes, ParserDisableThresholdInMills, or ParserDisablePercentage.

1. If this is the first time you're changing this configuration, copy it from lime_default.conf to custom_life_config.conf:

1. Navigate to /opt/exabeam/config/default/lime_default.conf, then navigate to the LogParser section.

2. Copy the LogParser section with the variables you're changing, then paste it in /opt/exabeam/config/custom/custom_lime_config.conf. For example:

LogParser{
#Output parsing performance in debug mode, be cautious this might affect performance in parsing
OutputParsingTime = true
OutputParsingTimePeriodInMinutes = 5
AllowDisableParser = true // If this is enabled, output parsing time will be enabled by default.
ParserDisableThresholdInMills = 7 //If average parsing time pass this threshold, we will disable that parser
ParserDisableTimePercentage = 0.5
}
2. In custom_lime_config.conf, change OutputParsingTimePeriodInMinutes, ParserDisableThresholdInMills, or ParserDisablePercentage.

• OutputParsingTimePeriodInMinutes – The intervals in which your system calculates a parser's average parse time and total parsing percentage. The value can be any integer, representing minutes.

• ParserDisableThresholdInMills – For a parser to be paused, its average parse time per log must exceed this threshold. The value can be any integer, representing millseconds.

• ParserDisablePercentage – For a parser to be paused, it must constitute more than this percentage of the total parsing time. The value can be any decimal between 0.1 and 0.9. The higher you set this percentage, the more severe the parsers you pause.

3. Save custom_lime_config.conf.

1. Source the shell environment:

source /opt/exabeam/bin/shell-environment.bash
2. Restart LIME:

lime-stop; lime-start

Disabled Event Types

When a high volume user or asset amasses a large number of events of a certain event type, and that event type contributes to a large portion of the overall event count for that user (typically 10M+ events in a session) the event type is automatically disabled and listed here.

Note

You are also shown an indicator when Advanced Analytics determines that the event type is problematic and disables it for the entity. The affected User/Asset Risk Trend and Timeline accounts for the disabled event type by displaying statistics only for the remaining events.

Disabled event types are displayed on the System Optimization tab of the System Health page. You can see a list of all event types that have been disabled, along with the users and assets for which they have been disabled for.

The Disabled Event Type by Users and Assets table is sorted first alphabetically by event type, then sorted by latest update timestamp.

The table includes columns with the following categories:

• Event Type – The disabled event type.

• Count – Last recorded total number of events for this entity.

• Last Log Received – Date and time of the event that triggered the disabling of this event type for the specified entity.

• Disabled Time – Date and time for when the event type was disabled for this entity.

You can also view disabled event types for an entity in metadata_db. Here is a sample disabled event types collection:

> db.event_
db.event_count_stats_collection db.event_disabled_collection
mongos> db.event_disabled_collection.findOne()
{
"_id" : ObjectId("5ce346016885455be1648a0f"),
"entity" : "exa_kghko5dn",
"last_count" : NumberLong(53),
"disabled_time" : NumberLong("1558399201751"),
"last_event_time" : NumberLong("1503032400000"),
"is_entity_disabled" : true,
"sequence_type" : "session",
"disabled_event_types" : [
"Batch-logon",
"Remote-access",
"Service-logon",
"Kerberos-logon",
"Local-logon",
"Remote-logon"
]
}
Configure Thresholds for Disabling Event Types
Hardware and Virtual Deployments Only

When an entity's session has 10 million or more events in a session and an event type contributes to 70% or more of the events in that session, then that event type is disabled. If no single event type accounts for over 70% of the total event count in that session, then that entity is disabled. These thresholds, EventCountThreshold and EventDisablePercentage are configurable.

To configure the thresholds, navigate to the Container section located in the exabeam_default.conf file at /opt/exabeam/config/default/exabeam_default.conf.

All changes should be made to /opt/exabeam/config/custom/custom_exabeam_config.conf:

Container {
...
EventCountThreshold = 10000000 // Martini will record the user as top user once a session or a sequence has more than
// 10 million events
TopNumberOfEntities = 10 // reporting top 10 users and assets
EventDisablePercentage = 0.7 // a single event type that accounts for 70% of all event types for a disabled user
EventCountCheckPeriod = 60 minutes
}

Acceptable values for EventCountThreshold are 10,000,000 to 20,000,000 for a typical environment. EventDisablePercentage can be a percentage decimal value between 0.1 to 0.9.

Hardware and Virtual Deployments Only

You can opt to manually configure the system load redistribution by creating a manual config section in custom_exabeam_config.conf.

To configure manual redistribution:

1. Run the following query to get the current distribution in your database:

mongo metadata_db --eval

'db.event_category_partitioning_collection.findOne()'

2. Taking the returned object, create a manual config section in /opt/exabeam/config/custom/custom_exabeam_config.conf.

3. Edit the configuration to include all the hosts and event categories you want.

4. Choose which categories should be shared by certain hosts by using true or false parameter values after “Shared =”.

For example, the below section shows that host 2 and 3 both share the web event category. All other categories, which are marked as Shared = “false”, are owned solely by one host.

Partitioner {
Partitions {
exabeam-analytics-slave-host3 = [
{
EventCategory = "database",
Shared = "false"
},
{
EventCategory = "other",
Shared = "false"
},
{
EventCategory = "web",
Shared = "true"
},
{
EventCategory = "authentication",
Shared = "false"
},
{
EventCategory = "file",
Shared = "false"
}]
exabeam-analytics-slave-host2 = [
{
EventCategory = "network",
Shared = "false"
},
{
EventCategory = "app-events",
Shared = "false"
},
{
EventCategory = "endpoint",
Shared = "false"
},
{
Shared = "false"
},
{
EventCategory = "web",
Shared = "true"
}
]
}
}

Exabeam can automatically identify overloaded worker nodes, and then take corrective action by evenly redistributing the load across the cluster.

This redistribution is done by measuring and comparing job completion time. If one node finishes slower by 50% or more compared to the rest of the nodes, then a redistribution of load is needed. The load is then scheduled to be rebalanced by event categories.

You can enable automatic system load redistribution on the System Optimization tab of the System Health page. This option is enabled by default. Doing so allows the system to check the load distribution once a day.

Note

It is not recommended that you disable the system rebalancing option as it will result in uneven load distribution and adverse performance impacts to the system. However, if you choose to do so, you can configure manual redistribution to avoid such impacts.

You must restart the Exabeam Analytics Engine for any changes to System Rebalancing to take effect.

The System Load Redistribution tab shows an indicator when a redistribution of load is needed, is taking place, or has completed. The rebalancing process can take up to two hours. During this time you may experience slower processing and some data may not be available. However, the system resumes normal operations once redistribution is complete.

Automatic Shutdown

If the disk usage on an appliance reaches 98%, then the Analytics Engine and log fetching will shut down automatically. Only when log space has been cleared and usage is below 98% will it be possible to restart.

Users can restart either from the CLI or the Settings page.