Advanced AnalyticsExabeam Advanced Analytics Administration Guide

Health Status Page

The Health Status page is an on-demand assessment of the Exabeam pipeline. It is broken down into three categories:

General Health - General health tests that all of the back-end services are running - database storage, log feeds, snapshots, CPU, and memory.

Connectivity - Checks that Exabeam is able to connect to external systems, such as LDAP and SIEM.

Log Feeds - This section reports on the health of the DC, VPN, Security Alerts, Windows Servers, and session management logs.

In all of the above areas, GREEN indicates the status is good, YELLOW for a warning, and RED if the system is critical.

Located on the homepage are the Proactive Health Checks that will alert administrations when:

  • Any of the core Exabeam services are not running

  • There is insufficient disk storage space

  • Exabeam has not been fetching logs from the SIEM for a configurable amount of time

Proactive and On-Demand Health Checks

Advanced Analytics provides visibility on the backend data pipeline via Health Checks. Health checks can be either proactive or on-demand.

Proactive health checks run automatically and periodically in the background. When a proactive check is triggered, a notification message is displayed.

Email notification indicated with bell sign on homepage menu.

On-demand health checks can be initiated manually and are run immediately. All newly gathered health check statuses and data is updated in the information panes on the page.

Health check categories are:

  • Service Availability – License expiration, database, disaster recovery, Web Common application engine, directory service, aggregators, and external connections.

  • Node Resources – Load, performance, and retention capacity.

  • Log Feeds – Session counts, alerts, and metrics.

  • System Health Information – Core data and operations processor metrics.

  • Service Availability (Incident Processors and Repositories) - IR, Hadoop, and Kafka performance metrics.

  • Elasticsearch Storage (Incident Responder) – Elasticsearch capacity and performance metrics.

All proactive and on-demand health checks are listed on the Health Checks page. Proactive health checks are visible by any Advanced Analytics user in your organization.

Note

Only users with administrator permission can reach the page.

System Health - Advanced Analytics Health Checks menu
Figure 3. System Health - Advanced Analytics Health Checks menu


Configure Alerts for Worker Node Lag

When processing current or historical logs, an alert will be triggered when the worker node is falling behind the master node. How far behind can be configured in /opt/config/exabeam/tequila/custom/health.conf. The parameters are defined below:

  • RTModeTimeLagHours - During real-time processing the default setting is 6 hours.

  • HistoricalModeTimeLagHours - During historical processing the default setting is 48 hours.

  • syslogIngestionDelayHour - If processing syslogs, the default setting is 2 hours.

}

slaveMasterLagCheck {

     printFormats = {

          json = "{ \"lagTimeHours\": \"$lagTimeHours\", \"masterRunDate\": \"$masterRunDate\", \"slaveRunDate\": \"$slaveRunDate\", \"isRealTimeMode\": \"$isRealTimeMode\"}"

          plainText = "Worker nodes processing lagging by more than $lagTimeHours hours. Is in real time: $isRealTimeMode "

}

RTModeTimeLagHours = 6

HistoricalModeTimeLagHours = 48

}

limeCheck {

     syslogIngestionDelayHour = 1

}

System Health Checks

Martini Service Check: Martini is the name Exabeam has given to its Analytics Engine. In a multi-node environment, Martini will be the Master node.

Tequila Service Check: Tequila is the name Exabeam has given to its User Interface layer.

Lime Service Check: LIME (Log Ingestion and Message Extraction) is the service within Exabeam that ingests logs from an organization's SIEM, parses and then stores them in HDFS. The main service mode parses message files and creates one message file per log file. This mode is used to create message files that will be consumed by the main node.

Mongo Service Check: MongoDB is Exabeam's chosen persistence database. A distributed MongoDB system contains three elements: shards, routers, and configuration servers (configsvr). The shards are where the data is stored; the routers are the piece that distributes and collect the data from the different shards; and the configuration servers which tracks where the various pieces of data are stored in the shards.

Zookeeper Service Check: Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. In a distributed multi-node environment, we need the ability to make a value change inside one process on a machine and have that change be seen by a different process on a different machine. Zookeeper provides this service.

Hadoop Service Check: Master - Hadoop is Exabeam's distributed file system, where the raw logs and parsed events are stored. These files are available to all nodes.

Ganglia Service Check: Ganglia is a distributed monitoring system for computing systems. It allows us to view live or historical statistics for all the machines that are being monitored.

License Checks: The status of your Exabeam license will be reported in this section. This is where you will find the expiration date for your current license.

Alerts for Storage Use

Available on the System Health page, the Storage Usage tab provides details regarding the current data retention settings for your Advanced Analytics deployment. Advanced Analytics sends notifications when available storage capacity dwindles to a critical level. Admins have the option to enable and configure automatic data retention and data purging for both HDFS and MongoDB usage.

Default Data Retention Settings

The table below lists default retention periods for data elements in your Advanced Analytics deployment:

Element

Description

Default Retention Period

Logs

Original logs

90 days

Events on Disk

Parsed events on disk

180 days

Events in Database

Parsed events in the database

180 days

Triggered Rules and Sessions in Database

Container and triggered rule collections in the database

365 days

System Health Alerts for Disabled Parsers

Get notified automatically when Advanced Analytics disables a parser.

To protect your system from going down and ensure that it continues to process data in real time, Advanced Analytics detects when a parser is taking an abnormally long time to parse a log, then disables it.

When Advanced Analytics disables a parser, you receive a health alert that describes which parser was disabled, on which node, and recommended actions you could take. When Advanced Analytics disables additional parsers, it resolves the existing health alert and sends a new alert that lists all disabled parsers.

Advanced Analytics also resolves an existing health alert any time you restart the Log Ingestion and Messaging Engine (LIME) and re-enable a disabled parser. For example, if you change the threshold for when parsers get disabled, you must restart LIME, which re-enables any disabled parsers and prompts Advanced Analytics to resolve any existing health alerts. If a re-enabled parser continues to perform poorly, Advanced Analytics disables it again and sends another health alert about that parser.

View Storage Usage and Retention Settings

By default, Advanced Analytics implements data retention settings for logs and messages. This allows the system to automatically delete old data, which reduces chances of performance issues due to low repository space.

Available on the System Health page, the Storage Usage tab provides details regarding the current data retention settings for your Advanced Analytics deployment, including:

  • HDFS Data Retention – Number of days the data has been stored, the number of days remaining before it is automatically deleted, and a toggle to enable/disable auto-deletion of old data.

Note

If enabling or disabling Auto-delete Old Data, you must also restart the Log Ingestion Engine before the change goes into effect. For more information on restarting the Log Ingestion Engine, please refer to the Restart the Analytics Engine section.

A warning message appears if the average rate of HDFS usage exceeds the intended retention period.

  • HDFS Usage – Total HDFS usage, including specific breakdowns for logs, events on disk, and events in database .

  • MongoDB Retention – A dialog to set retention settings and a toggle to enable/disable disk-based deletion of old data.

    You can edit the MongoDB data retention settings by clicking the pencil icon.

    The capacity used percentage threshold is set to 85% by default, with a maximum value of 90%. It helps to prevent MongoDB from reaching capacity before hitting the default retention period threshold. As soon as MongoDB meets the percentage on any node, the system will start purging events until it is back below the capacity used threshold.

    In addition to the capacity used percentage, Advanced Analytics keeps 180 days of event data in MongoDB and 365 days of triggered rules and containers data by default.

    Note

    The Days for Triggered Rules & Sessions value cannot by less than the Days for Events value.

  • MongoDB Usage – Total MongoDB usage.

Set Up System Optimization

This tab is a single aggregated page for auditing and viewing disabled data types (including models, event types, and parsers) and system load redistribution.

Disabled Models

When a model takes up too much memory, it is disabled. If you enable these models, the system may suffer performance issues.

Note

Exabeam disables models as a last resort to ensure system performance. We have other defensive measures to prevent models from using all of the system's memory. These measures include enabling model aging and limiting the model bin size count. If these safeguards cannot prevent a model from consuming too much memory, the system prevents itself from shutting down as it runs out of memory.

Configure Model Maximum Bin Limit

Models are disabled once they reach their maximum bin limit. You can either set a global configuration for the maximum number of bins or an individual configuration for each model name.

This is done by setting a max number of bins, MaxNumberOfBins in exabeam_default.conf or the model definition, for categorical models. The new limit is 10 million bins, although some models, such as ones where the feature is "Country" have a lower limit. We have put many guardrails in place to make sure models do not consume excessive memory and impact overall system health and performance. These include setting a maximum limit on bins, enabling aging for models, and verifying data which goes into models to make sure it is valid. If a model is still consuming excessive amounts of memory then we will proceed to disable that model.

To globally configure the maximum bin limit, navigate to the Modeling section located in the exabeam_default.conf file at /opt/exabeam/config/default/exabeam_default.conf.

All changes should be made to /opt/exabeam/config/custom/custom_exabeam_config.conf:

Modeling {
      ...
      # To save space we limit the number of bins saved for a histogram. This defaults to 100 if not present
      # This parameter is only for internal research
      MaxSizeOfReducedModel = 100
      MaxPointsToCluster = 250
      ReclusteringPeriod = 24 hours
      MaxNumberOfBins = 1000000

Additionally, it is possible to define a specific bin limit on a per model basis via the model definition section in the models.conf file at /opt/exabeam/config/custom/models.conf. Here is an example model, where MaxNumberOfBins is set to 5,000,000:

WTC-OH {
      ModelTemplate = "Hosts on which scheduled tasks are created"
      Description = "Models the hosts on which scheduled tasks are created"
      Category = "Asset Activity Monitoring"
      IconName = ""
      ScopeType = "ORG"
      Scope = "org"
      Feature = "dest_host"
      FeatureName = "host"
      FeatureType = "asset"
      TrainIf = """count(dest_host,'task-created')=1"""
      ModelType = "CATEGORICAL"
      BinWidth = "5"
      MaxNumberOfBins = "5000000"
      AgingWindow = ""
      CutOff = "10"
      Alpha = "2"
      ConvergenceFilter = "confidence_factor>=0.8"
      HistogramEventTypes = [
      "task-created"
      ]
      Disabled = "FALSE"
}

Disabled Parsers

To protect the system from going down and ensure that it keeps processing data in real time, Advanced Analytics automatically identifies poor parser performance and disables such parsers.

We determine the average parse time per event for each parser every five minutes. We compare that to a configurable threshold variable in lime.conf, ParserDisableThresholdInMills. Then we divide each by the total time taken by all parsers in the same five minute period and compare the values to a configurable threshold variable in lime.conf, ParserDisableTimePercentage. If the parsers average parse time exceeds the threshold and it exceeds the second threshold of being over a certain percentage of the overall parse time by all parsers then it becomes a candidate for disabling. We perform the same check during a second five minute period and if the same holds true then we proceed to disable the parser.

A slow parser is placed in a cache during the first parsing time period, and is then disabled during the second period if its average parsing time continues to meet the following conditions:

  • It is above the parsing time threshold, which is set at 7 ms by default. The parsing time threshold, ParserDisableThresholdInMills, is configurable.

  • It makes up 50% or more of the total parsing time of all parsers. The parsing time percentage, ParserDisableTimePercentage, is configurable.

When a parser meets these conditions on a Log Ingestion and Messaging Engine (LIME) node, it is disabled only on that node. If you have multiple LIME nodes, it is not automatically disabled on all nodes unless it meets these conditions on every node.

When Advanced Analytics disables a parser on any node, you receive a system health alert.

Disabled parsers are also displayed on the System Optimization tab of the System Health page. You can see a list of all parsers that have been disabled.

Note

You are also shown an indicator when Advanced Analytics determines that a parser is problematic and disables it.

System Health - System Optimization menu
Figure 4. System Health - System Optimization menu


The Disabled Parsers table is sorted alphabetically by parser name.

The table includes columns with the following categories:

  • Parser Name – The name of the disabled parser.

  • Average Log Line Parse Time – Average time taken by the parser to parse each event.

  • Disabled Time – Date and time when the parser was disabled.

You can also view disabled parsers in disabled_parsers_db.current_collection. Here is a sample disabled parser collection:

{
    "_id" : ObjectId("5cf5bb9e094b83c6f7f89b86"),
    "_id" : ObjectId("5cf5bb9e094b83c6f7f89b86"),
    "parser_name" : "raw-4625",
    "average_parsing_time" : 9.774774774774775,
    "time" : NumberLong(1559608222854)
}
Configure Thresholds for Disabling Parsers
Hardware and Virtual Deployments Only

The parsing time period, OutputParsingTimePeriodInMinutes, is configurable.

Navigate to the LogParser section located in the lime_default.conf file at /opt/exabeam/config/default/lime_default.conf.

All changes should be made to /opt/exabeam/config/custom/custom_lime_config.conf:

LogParser{
    #Output parsing performance in debug mode, be cautious this might affect performance in parsing
    OutputParsingTime = true
    OutputParsingTimePeriodInMinutes = 5
    AllowDisableParser = true // If this is enabled, output parsing time will be enabled by default.
    ParserDisableThresholdInMills = 7 //If average parsing time pass this threshold, we will disable that parser
    ParserDisableTimePercentage = 0.5
}

Acceptable values for ParserDisableThresholdInMills includes any integer value. ParserDisablePercentage can be a percentage decimal value between 0.1 to 0.9.

Note

Setting a higher parsing time percentage would identify more severe parsers.

Disabled Event Types

When a high volume user or asset amasses a large number of events of a certain event type, and that event type contributes to a large portion of the overall event count for that user (typically 10M+ events in a session) the event type is automatically disabled and listed here.

Note

You are also shown an indicator when Advanced Analytics determines that the event type is problematic and disables it for the entity. The affected User/Asset Risk Trend and Timeline accounts for the disabled event type by displaying statistics only for the remaining events.

Disabled event types are displayed on the System Optimization tab of the System Health page. You can see a list of all event types that have been disabled, along with the users and assets for which they have been disabled for.

System Health - System Optimization menu
Figure 5. System Health - System Optimization menu


The Disabled Event Type by Users and Assets table is sorted first alphabetically by event type, then sorted by latest update timestamp.

The table includes columns with the following categories:

  • Event Type – The disabled event type.

  • Count – Last recorded total number of events for this entity.

  • Last Log Received – Date and time of the event that triggered the disabling of this event type for the specified entity.

  • Disabled Time – Date and time for when the event type was disabled for this entity.

You can also view disabled event types for an entity in metadata_db. Here is a sample disabled event types collection:

> db.event_
db.event_count_stats_collection db.event_disabled_collection
mongos> db.event_disabled_collection.findOne()
{
    "_id" : ObjectId("5ce346016885455be1648a0f"),
    "entity" : "exa_kghko5dn",
    "last_count" : NumberLong(53),
    "disabled_time" : NumberLong("1558399201751"),
    "last_event_time" : NumberLong("1503032400000"),
    "is_entity_disabled" : true,
    "sequence_type" : "session",
    "disabled_event_types" : [
    "Dlp-email-alert-in",
    "Batch-logon",
    "Remote-access",
    "Service-logon",
    "Kerberos-logon",
    "Local-logon",
    "Remote-logon"
    ]
}
Configure Thresholds for Disabling Event Types
Hardware and Virtual Deployments Only

When an entity's session has 10 million or more events in a session and an event type contributes to 70% or more of the events in that session, then that event type is disabled. If no single event type accounts for over 70% of the total event count in that session, then that entity is disabled. These thresholds, EventCountThreshold and EventDisablePercentage are configurable.

To configure the thresholds, navigate to the Container section located in the exabeam_default.conf file at /opt/exabeam/config/default/exabeam_default.conf.

All changes should be made to /opt/exabeam/config/custom/custom_exabeam_config.conf:

Container {
    ...
    EventCountThreshold = 10000000 // Martini will record the user as top user once a session or a sequence has more than
    // 10 million events
    TopNumberOfEntities = 10 // reporting top 10 users and assets
    EventDisablePercentage = 0.7 // a single event type that accounts for 70% of all event types for a disabled user
    EventCountCheckPeriod = 60 minutes
}

Acceptable values for EventCountThreshold are 10,000,000 to 20,000,000 for a typical environment. EventDisablePercentage can be a percentage decimal value between 0.1 to 0.9.

Manually Redistribute System Load

Hardware and Virtual Deployments Only

You can opt to manually configure the system load redistribution by creating a manual config section in custom_exabeam_config.conf.

To configure manual redistribution:

  1. Run the following query to get the current distribution in your database:

    mongo metadata_db --eval

    'db.event_category_partitioning_collection.findOne()'

  2. Taking the returned object, create a manual config section in /opt/exabeam/config/custom/custom_exabeam_config.conf.

  3. Edit the configuration to include all the hosts and event categories you want.

  4. Choose which categories should be shared by certain hosts by using true or false parameter values after “Shared =”.

    For example, the below section shows that host 2 and 3 both share the web event category. All other categories, which are marked as Shared = “false”, are owned solely by one host.

Partitioner {
    Partitions {
        exabeam-analytics-slave-host3 = [
            {
                EventCategory = "database",
                Shared = "false"
            },
            {
                EventCategory = "other",
                Shared = "false"
            },
            {
                EventCategory = "web",
                Shared = "true"
            },
            {
                EventCategory = "authentication",
                Shared = "false"
            },
            {
                EventCategory = "file",
                Shared = "false"
            }]
        exabeam-analytics-slave-host2 = [
            {
                EventCategory = "network",
                Shared = "false"
            },
            {
                EventCategory = "app-events",
                Shared = "false"
            },
            {
                EventCategory = "endpoint",
                Shared = "false"
            },
            {
                EventCategory = "alerts",
                Shared = "false"
            },
            {
                EventCategory = "web",
                Shared = "true"
            }
        ]
    }
}

Automatically Redistribute System Load

Exabeam can automatically identify overloaded worker nodes, and then take corrective action by evenly redistributing the load across the cluster.

This redistribution is done by measuring and comparing job completion time. If one node finishes slower by 50% or more compared to the rest of the nodes, then a redistribution of load is needed. The load is then scheduled to be rebalanced by event categories.

You can enable automatic system load redistribution on the System Optimization tab of the System Health page. This option is enabled by default. Doing so allows the system to check the load distribution once a day.

Note

It is not recommended that you disable the system rebalancing option as it will result in uneven load distribution and adverse performance impacts to the system. However, if you choose to do so, you can configure manual redistribution to avoid such impacts.

You must restart the Exabeam Analytics Engine for any changes to System Rebalancing to take effect.

The System Load Redistribution tab shows an indicator when a redistribution of load is needed, is taking place, or has completed. The rebalancing process can take up to two hours. During this time you may experience slower processing and some data may not be available. However, the system resumes normal operations once redistribution is complete.

Automatic Shutdown

If the disk usage on an appliance reaches 98%, then the Analytics Engine and log fetching will shut down automatically. Only when log space has been cleared and usage is below 98% will it be possible to restart.

Users can restart either from the CLI or the Settings page.