HLA System Properties - Support and Troubleshooting

HLA System Properties The purpose of this document is to list and provide an explanation for the HLA System Properties ( sn_occ_system_settings)

The default (out of the box) settings are generally sufficient, however, in some cases customers may require changes and/or optimisations to the core HLA System settings for some of the following reasons. Enhance system performance (reduce bottlenecks) Reduce noise Adjust behaviour Suit customer's data The Properties below are grouped by subject, they are not in the order that they appear in the System Properties Table in HLA.

AGGREGATOR Middle of the Data Ingestion pipeline. Responsible for grouping and storing of metrics.

aggregator.bloom_filter_factor The coefficient which we should multiply by the Bloom Filter size

aggregator.concurrency_override If specified, the value will override the initial automatic allocation of resources to the aggregator.

aggregator.gauge.aggregation_type Controls whether gauge metrics should be tested using the average or median (default). Accepted values are: Average or Median

aggregator.metrics_bloom_filter_fpp Expected false positive probability for the Bloom Filter. Used for monitoring purposes.

aggregator.min_non_null_values_for_stats Minimum number of non null values in a series to calculate stats for (moving average, stdev...).

aggregator.number_of_expected_metrics Approximation for how many unique metrics the aggregator should handle.

aggregator.queue_size Number of metrics that can be buffered in the aggregator before it starts blocking the processing pipe.

aggregator.resolution_seconds The resolution of the time series. i.e each data point represents the aggregation of data over this period of seconds.

aggregator.settle_seconds How many seconds should pass without receiving data until the window is considered settled. Once a window is settled, the detective can start running its algorithms.

aggregator.window_max_quantity_in_period_hours Circuit-Breaker: Max active time-span of a metric, in hours. Events with metrics that arrive with timestamps spanning a wider time-span will not be aggregated.

aggregator.window_size_seconds Value must be multiples of 60. How many seconds are considered a window. Windows are the time frame for the detection tasks to be exeucted on.

aggregator.window_size_seconds_custom Value must be multiples of 60. How many seconds are considered a window for a CustomMetric. Windows are the time frame for the detection tasks to be exeucted on.

aggregator.workload_level Work load level in which the Aggregator is considered stressed, options are: LOW, MEDIUM, HIGH.

ALERTS alerts.annotation_property Part of the alert settings that deals with checking annotations for correlations.

alerts.is_anomaly_baseline_reference_decrease_disabled Indicates if 'anomaly_baseline_reference_decrease' alert is disabled

alerts.max_alert_age_hours Anomaly Detection won't apply to events older than this setting. This allows the system to identify and discard alerts that are considered 'too old'. If you are streaming real-time data and still see detection windows being dropped for age, this might indicate: 1) delay in the processing pipeline (for example: a specific Data Input was stopped for a couple of hours, then started again), 2) incorrect extraction of the timestamp field (for example wrong timezone: the timestamp being sent is supposed to be read as Easter Standard Time, but is being read as UTC since there is no indicator in the timestamp). If you are streaming historic data, this setting MUST be increased to include the dates of the historical data. (for example: if today is Jan 2021, and the historical data is being streamed from Jan 2020, please make sure the time here is AT LEAST 8760 (hours) i.e. 365 days * 24 hours/day) {Also note that an additional setting should also be increased: broker.events.max_age_hours}

alerts.recent_events.max_size_bytes The maximum size (bytes) allowed for recent events

alerts.recent_events_for_timeless_gauge_period_seconds same as alerts.recent_events_period_seconds but for timeless-gauge detections

alerts.recent_events_period_seconds Time period to look-back from the point of the anomaly to fetch relevant events - which will be used for RCA. (recommendation is not to exceed around 24 hours in seconds)

ALERTS CREATOR The last part of the pipeline before alerts are created and populated in the incident list.

alerts_creator.queue_size Number of detections that can be buffered in the alerts creator before it starts blocking the processing pipe.

ARCA (Automatic Root Cause Analyses): This refers to the properties extracted via Source Types that are categorized as "ARC_only".

arca.entities_analyzer.max_days_lookback To build the "meaningful entities" section of the RCA report, the AI engine goes back up to this number of days to analyze relevant events. (recommendation is not to exceed 2 days)

arca.entities_analyzer.max_entity_occurrences For each entity presented in the root-cause section, the AI engine adds events surrounding the detection time. This setting controls the number of such events that will be added.

arca.highlights_analyzer.majority_vote In the highlights analyzer, min number of past matching events from the same host/day/hour to qualify as a highlight

arca.highlights_analyzer.max_days_lookback To build the "highlights" section of the RCA report, the AI engine goes back up to this number of days to analyze relevant events.

arca.highlights_analyzer.number_of_highlights Number of highlights to be presented in the "highlights" section of the RCA

arca.mf_analyzer.number_of_changes Max number of changes to show in the "significant changes" section in the RCA

BROKER This is the start of the Data Ingestion Pipeline. It is responsible for data integration and digestion, including parsing of the logs.

broker.concurrency_override If specified, the value will override the initial automatic allocation of resources to the event broker.

broker.events.max_age_hours Events older than this number of hours will be dropped

broker.headerdetection.vmware list of vmware apps used by header detection to detect as vmware

broker.header_detection.detect_beaver When on, the AI engine will attempt to detect and parse beaver headers. Default is ON

broker.header_detection.detect_syslog5424 When on, the AI engine will attempt to detect and parse Syslog5424 headers. Default is ON

broker.queue_size Number of events that can be buffered in the event broker before it starts blocking the processing pipe.

broker.workload_level Work load level in which the Event Broker considered stressed, options are: LOW, MEDIUM, HIGH.

CLOTHO clotho.batch_size Bulk size for persisting data points to Clotho

clotho.duration_days Clotho duration time per days

clotho.sampling_interval_minutes Clotho Sampling Interval per minute

DATA INPUTS Responsible for the fetching or receiving of logs from different mediums.

data_inputs.abstract_queue_size Queue size of all data input.

data_inputs.examples_refresh_interval Interval, in minutes, for updating the data-input examples in the database

data_inputs.max_length_bytes_per_stream Max size (in bytes) of a single request that can be handled by any data input

data_inputs.preprocess.examples.buffer.size Size of buffer for preprocess examples

data_input_mapping.max_examples Define the maximum number of samples to show on the Data Input Mapping screen, up to 500

DETECTIVE Towards the end of the Data Ingestion Pipeline. Responsible for spotting 'regular' and 'anomalous' behavior in the data. Running multiple anomaly detection algorithms.

detective.alive_period_seconds_for_signal_dead The minimum period, in seconds, that a signal has to be alive before "dropping dead", for a signal-dead alert to be fired. Additionally, if there was another "dead signal" with similar duration in this period, then the current one will be disqualified.

detective.allowed_future_time_minutes Acceptable futuristic detection period. If a detection was created with a source time further down the future, its handling will be delayed.

detective.amplitude_coefficient This setting effects the overall sensitivity of the anomaly detection engine. The higher the number, the less alerts you will see.

detective.anomaly_detection.enabled When set to false, the AI engine will not attempt anomaly detection.

detective.concurrency_override If specified, the value will override the initial automatic allocation of resources to the detective.

detective.detection_task_delay_seconds Delay in seconds before starting a detection task after the corresponding window has settled.

detective.few_elements A deep setting that effects the tolerance of weaker detection techniques

detective.global.mute_disabled When set to true, the mute or disable feedback will apply on a specific metric across ALL Application-services

detective.max_moments_in_memory.derivative This setting effects the tolerance of the derivative algorithm

detective.max_moments_in_memory.signal_alive How many "similar-in-amplitude" bursts should the signal-alive detector allow in the preceding period. This setting is in effect when raising an alert

detective.max_moments_in_memory.signal_dead How many "dead periods" should the signal-dead detector allow. This setting is in effect when raising an alert

detective.memory_in_days The memory, in days, of the different anomaly detection models (baseline, derivative and others)

detective.min_events_per_window The min number of events per window for a detection to be triggered.

detective.of_custom_alert_concurrency_override If specified, the value will override the initial automatic allocation of resources to the customAlertDetective.

detective.points_in_timeless_trend The number of samples to consider when testing for trend shifts in disperse metrics

detective.queue_over_capacity_percent Will create a system notification when the detective queue is greater than value% capacity for over 5 minutes.

detective.queue_size Number of detection tasks that can be buffered in the detective before it starts blocking the processing pipe.

detective.resolution.signal_dead The number of seconds a metric's signal must be consecutively "dead" (no data, graph showing zero) for a signal-dead detection to be triggered for this metric. This setting can be configured per source.

detective.sigma_coefficient The coefficient for the sigma-based anomaly detection

detective.workload_level Work load level in which the Detective considered stressed, options are: LOW, MEDIUM, HIGH.

ELASTICSEARCH elasticsearch.bulk_actions Number of entities in one bulk request, using as threshold for elastic bulk operation (together with the request size).

elasticsearch.bulk_concurrent_requests Concurrency of the Bulk write-requests to Elastic

elasticsearch.bulk_size_mb Size of the bulk request in MB, using as threshold for elastic bulk operation (together with the request entities number threshold).

elasticsearch.client.connect_timeout_millis Configures the timeout in milliseconds until a connection is established to elasticsearch.

elasticsearch.client.io_threads Configures the number of I/O dispatch threads to be used by the elasticsearch client

elasticsearch.client.socket_timeout_millis Configures the socket timeout in milliseconds to elasticsearch, which is the timeout for waiting for data or, put differently, a maximum period inactivity between two consecutive data packets.

elasticsearch.concurrency How many threads should index to elasticsearch

elasticsearch.flush_interval_seconds Bulk flush interval for indexing. Bulk will execute sooner if either bulk_actions or bulk_size_mb has been reached.

elasticsearch.mapping_keyword_properties By default, all string properties are indexed as 'keyword' (except message, rawMessage, stacktrace, and the additional_string_properties which are indexed as 'text') which allows aggregation but no partial searches. Any field with the 'property' prefix (e.g. 'property.UUID' ,or 'property.srcIp') can also be indexed as 'keyword'. Note that a change will only apply to newly created indices. Please also note that when inserting the value you are not adding the prefix 'property'.

elasticsearch.minimal_indexing When true, properties classified as invalid will not be indexed

elasticsearch.queue_size Bulk size for indexing.

elasticsearch.subsampling_ratio Use this to only index some of the events to Elastic. 1 -> index 1 out of 1 events. 2 -> index 1 out of 2 events. N -> index one out of N events.

EVENTS events.keyword_extraction_non_patterned Keyword extraction from the non-patterned message, when there is no pattern (i.e. message label is not assigned)

events.max_minutes_in_future Events that are further in the future than this will be dropped. Note: If you see event being dropped due to future timestamp double check that your timestamps are in the correct timezone.

GLIDE glide.datainput.max_errors_percentage_before_publish Define the max % of errors in a data input before publishing a notification.

glide.table.change_detection.interval_seconds Interval, in seconds, for getting tables that changed in glide

grpc.port Define glide port

health_log_analytics.use_case_export.enabled Enables/disables the migration of data inputs and source types between instances.

INCIDENTS incidents.alerts.dilute_target When the number of alerts in an incident is too high (see incident.alerts.max_count), alerts are diluted (removed) until this number is reached.

incidents.alerts.max_count Maximum number of alerts in incident, to start the dilution process of excess alerts.

incidents.alert_interval_seconds Time-span to consider two alerts as related if correlating by occurrence time

incidents.application_correlation Should the alert application be taken into account when correlating alerts

incidents.component_correlation Should the alert service be taken into account when correlating alerts

incidents.cooldown_period_minutes Minutes to wait after the creation of an incident before sending a notification about it.

incidents.detection_time_correlation Please note that time frame of the correlation window is defined using: incidents.alert_interval_seconds setting

incidents.detection_type_correlation Should the alert detection type (anomaly, signal-dead, baseline etc.) be taken into account when correlating alerts

incidents.entities List of entities that will be used for correlation if two alerts shared the same value. This setting should be managed from the correlations setting pages (global or individual per-source)

incidents.host_correlation Should the alert host be taken into account when correlating alerts

incidents.min_correlation_score_for_aggregating Sensitivity level for the correlation engine. The higher the number is, the more alerts will need to have in common in order to be correlated

incidents.pattern_text_correlation Should the alert pattern-text be taken into account when correlating alerts

incidents.period_seconds Defines the time-frame for the Alerts Smart-Correlations logic (an alert might only correlate with alerts created in the preceding T hours)

KEYWORDS dictionaries.resource.directory Directory name in which the dictionaries used in keyword's message extraction process are

keywords.message.extractor When set to false, the AI engine will not attempt to automatically extract the message from Keyword-based alerts

keywords.message.max.length Messages over the maximal specified length will not be extracted

keywords.message.stop.elsa.message When set to true, Elsa will not attempt to automatically extract the message labels

LICENSING licensing.flush_interval_seconds Time interval after which nodes will flushed to the glide table

licensing.max_map_size Maximum number of nodes stored in memory before flushing to glide table

licensing.monitoring.interval_seconds Time interval after which licensing monitoring service wakes up to check for new nodes

LOGSOURCEINFO (CMDB)* logsourceinfo.flush_interval_seconds Time interval, in seconds, for collecting log source host data and forwarding it to the Log-based CI candidate table. (Default value = 3600)

logsourceinfo.max_map_size Maximum number of data nodes to be stored before the data is forwarded to the Log-based CI candidates table. (Default value = 1000)

logsourceinfo.monitoring.interval_seconds Time interval, in seconds, for scanning log events to discover host-related data. (Default value = 60)

* NEW system properties available starting from HLA December Store release

METRICATOR Middle of the Data ingestion pipeline. Responsible for storing and measuring unique metrics.

metricator.cache_eviction_factor Number of raw metrics to evict from the cache when eviction is needed

metricator.cache_size Maximum number of raw metrics to hold in memory

metricator.concurrency_override If specified, the value will override the initial automatic allocation of resources to the metricator.

metricator.exclude_extended_keyword_metrics A list of regex that will exclude extended keyword metrics of type ERROR/EXCEPTION from being created. ex: .*love.*

metricator.min_severity Minimum severity of an event for creating raw and pattern metrics

metricator.new_pattern_min_severity Minimum severity of an event for creating new pattern metrics

metricator.queue_size Number of events that can be buffered in the metricator before it starts blocking the processing pipe.

metricator.workload_level Work load level in which the Metricator considered stressed, options are: LOW, MEDIUM, HIGH.

NOTIFICATIONS notifications.default.recipients.configuration_notifications Default recipients of configuration-related notifications, such as JS errors, timestamp parsing etc.

notifications.default.recipients.operational_notifications Default recipients of operations-related notifications, such as Crashes

PATTERNATOR Pattern recognition and analysis

patternator.cache_eviction_factor Number of patterns to evict from the cache when eviction is needed

patternator.cache_size Maximum number of patterns to hold in memory

patternator.concurrency_override If specified, the value will override the initial automatic allocation of resources to the patternator.

patternator.gbp.bulk.queue_size Max number of GBP statements pending to be written to the DB, before updates start getting dropped.

patternator.gbp.bulk.size Number of statements to trigger an update to the DB

patternator.gbp.examples.until.greedy Num of events to learn which greedy replacements to use

patternator.gbp.max.node.chars Max number of characters in gbp node

patternator.queue_size Number of events that can be buffered in the patternator before it starts blocking the processing pipe.

patternator.rate_limit Maximum number of new patterns per second

patternator.