Data Anomalies

Data anomalies are unexpected values that differ significantly from past data. These deviations disrupt typical patterns and can arise from data integration issues, process changes, migration errors, human mistakes, or system failures.

Anomalies can lead to inaccurate analysis, decisions, and reports, potentially harming businesses. Detecting and fixing them is vital for good data quality management, ensuring data is accurate, complete, consistent, and valid.

OvalEdge identifies anomalies during profiling by comparing newly profiled data with previously profiled data. Any inconsistencies are flagged as anomalies.

A list of detected anomalies is available for review in Data Anomalies. This section provides details like anomaly type, assigned person for correction, and other relevant information.

Data Anomaly Detection

  • Supported Objects: Schemas, Tables, and Table Columns (for RDBMS connectors only).

  • Algorithms: Deviation or Deflection (configurable by admins in System Settings).

  • Calculated Statistics: The table below presents the computed statistics associated with each object type.

Object Type

Profiling Statistics

Anomaly Type based on Algorithm

Schema

Table Count

Deviation:

  • Table Count Deviation

IQR:

  • Table Count Deflection

Table

Row Count

Deviation:

  • Row Count Deviation

IQR:

  • Row Count Deflection

Table Column

Distinct Count

Null Count

Deviation:

  • Distinct Count Deviation

  • Null Count Deviation

IQR:

  • Distinct Count Deflection

  • Null Count Deflection

Data Anomaly Algorithm

OvalEdge offers two anomaly detection algorithms that are configurable through System Settings.

Interquartile Range (IQR)

The Interquartile Range (IQR) algorithm is a statistical measure used to assess the spread or dispersion of a dataset. It focuses on the middle 50% of the data, specifically the range between the first quartile (Q1) and the third quartile (Q3).

The steps below are involved in the calculation:

  • Sort the Data:

  • Arrange the dataset in ascending order.

  • Calculate Quartiles:

    • Q1 (First Quartile): This is the median of the lower half of the dataset.

    • Q3 (Third Quartile): This is the median of the upper half of the dataset.

  • Calculate IQR:

    • IQR is determined by subtracting Q1 from Q3: IQR = Q3 - Q1.

  • Identify Outliers:

    • Any data points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are considered potential outliers.

Example: Consider a scenario where we have profiled a table with the following statistics: Profile Executions: Profile 1 Row Count - 40 Profile 2 Row Count - 67 Profile 3 Row Count - 87 Profile 4 Row Count - 76 Profile 5 Row Count - 85 Profile 6 Row Count - 70 Profile 7 Row Count - 80

The steps below are involved in the calculation:

  • Sort the Data:

  • Arrange the dataset in ascending order.

    • 40, 67, 70, 76, 80, 85, 87

  • Calculate Quartiles:

    • Q2 (Second Quartile): 76 (Overall Median)

    • Q1 (First Quartile): Median of the First Half (40, 67, 70) = 67

    • Q3 (Third Quartile): Median of the Second Half (80, 85, 87) = 85

  • Calculate IQR:

    • IQR is determined by subtracting Q1 from Q3:

    • IQR = Q3 - Q1

    • IQR = 85 - 67 = 18

  • Identify Outliers:

    • Any data points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are considered potential outliers.

    • Q1 - 1.5 * IQR = 67 - 1.5 * 18 = 40

    • Q3 + 1.5 * IQR = 85 + 1.5 * 18 = 112

  • Now let us assume we have done the 8th Profile,

    • Case 1: We got the Row Count as 150

    • Case 2: We got the Row Count as 90

    • In Case 1, we consider it an anomaly and report it to the Data Anomalies because it does not fall under the range of 40 - 112.

    • In Case 2, we consider it a normal value and do not report it to the Data Anomalies because it falls under the range of 40 - 112.

Deviation

The Deviation Algorithm assesses the deviation percentage in profiling metrics among supported data objects. The Last Profiled and the Recently Profiled data are considered for anomaly detection using the deviation algorithm. The algorithm computes the percentage increase or decrease by comparing the recent and last profiled data statistics.

Example: Consider a scenario where we are profiling a table with the following statistics:

  • Row Count in First Profiling: 50

  • Row Count in Second Profiling: 85

Now, let's compute the deviation percentage:

Deviation Percentage = ((Recent Profiled - Last Profiled) / Last Profiled) * 100

Deviation Percentage = ((85 - 50) / 50) * 100

= 70%

  • The deviation percentage between the first and second profiling is 70%.

    • Case 1: If the user sets the threshold value at 50%, we consider this an anomaly and report it to the Data Anomalies as a positive deviation.

    • Case 2: If the user sets the threshold value at 75%, we consider this a normal value and do not report it.

Data Anomaly Detection Configuration

Levels:

  • Global: Apply settings across the entire application.

  • Connector: Configure settings for specific RDBMS connectors and override global settings for that connector.

  • Schema: Choose which schemas within an RDBMS connector to analyze for anomalies.

  • Object: Define custom settings for individual profiling statistics of each data object.

Global Level

Global settings are configured in Administration > System Settings and apply across the entire application.

  • Enable/Disable Anomaly Detection:

    • Key: anomaly detection analysis enabled

    • Default Value: TRUE

      • TRUE: Enables anomaly detection functionality.

      • FALSE: Disables anomaly detection functionality. Note: Disabling thegrays out connector and data object level settings.

  • Supported Anomaly Detection Algorithms:

Key: anomaly.detection.analysis.algorithm.list

  • Default Value: Deviation, IQR

  • Description: This setting defines the list of available algorithms for anomaly detection. Users can enter the algorithms' names separated by commas.

  • Active Anomaly Detection Algorithm:

Key: anomaly.detection.analysis.algorithm.selection

  • Default Value: Deviation

  • Description: This setting allows users to choose the primary algorithm used for anomaly detection.

  • IQR Threshold:

Key: anomaly.detection.analysis.iqr.range

  • Default Value: 1-50

  • Description: This setting defines the IQR algorithm's threshold range (percentage change) to identify anomalies. Values outside this range are flagged.

  • Deviation Threshold

Key: anomaly.detection.analysis.deviation.threshold

  • Default Value: 50

  • Description: This setting defines the threshold percentage (above or below) for the deviation algorithm to identify anomalies. Values exceeding this threshold are flagged.

Connector Level

Anomaly detection is only available for RDBMS connectors. Integration Admins can configure unique settings for each connector, overriding the global settings.

Configuration Steps:

  1. Navigate to Administration > Connectors.

  2. Select an RDBMS Connector (SQL Server) and access its "Nine Dots."

  3. Select “Settings” and go to the "Anomaly Detection Settings" tab.

  1. Choose "Custom Setting" and configure the following:

  • Enable/Disable anomaly detection for this specific connector.

  • Preferred anomaly detection algorithm (Deviation Percent or InterQuartile Range)

  • Threshold Range/Percentage for the chosen algorithm.

Schema Level

Integration Admins can choose which schemas within an RDBMS connector to analyze for anomalies.

Configuration Steps:

  1. Go to Administration > Connectors.

  2. Select an RDBMS connector.

  3. Click the "Crawl/Profile" button.

  1. Choose the number of schemas to include for anomaly detection.

  2. Select "Crawl/Profile" or "Profile" (profiling is required).

  3. Check "Execute Anomaly Detection."

  4. Click "Run" to start background jobs that calculate anomalies for the chosen schemas.

Notes:

  • Anomaly detection settings are only available for RDBMS connectors.

  • Profiling is mandatory for anomaly detection to function.

Object Level

The Object Level custom settings are configured specifically for the selected profiling statistics of each data object.

Example: Null Count and Distinct Count are the two available profiling statistics for table columns. To configure custom settings for a table column titled "CHARACTER_SET_NAME," Null Count is selected first, and configuration is done accordingly. However, settings applied for Null Count will not be reflected for Distinct Count statistics.

Configuration Steps:

  1. Navigate to Data Catalog. Select any supported object type (Schema, Table, Table Column).

  2. Click on the Anomaly Detection Settings from the Nine Dots.

  3. Select the Custom Setting Radio button.

  4. Choose the preferred algorithm for anomaly detection.

  5. Specify the threshold range/percentage for the selected algorithm.

  6. Use the Skip Anomaly Detection option to exclude anomaly detection for the selected profiling statistics of the particular data object (Schema, Table, Table Column).

Note: The Skip Anomaly Detection option is available in the object-level settings, offering the flexibility to include or exclude the selected data object from anomaly detection. The object-level settings take precedence over the connector and global-level settings for Anomaly Detection.

Viewing Data Anomalies

The Data Anomalies page provides a central location to review all detected anomalies and their details.

All

This tab lists all identified anomalies and offers various filters and sorting options for detailed analysis:analysis:

Field

Description

Anomaly ID

Displays the unique identifier for each anomaly.

Anomaly Type

Displays the category of the anomaly based on profiling statistics and detection algorithms.

Example:

  • IQR:

    • Table Count Deflection

    • Row Count Deflection

    • Distinct Count Deflection

    • Null Count Deflection

  • Deviation:

    • Table Count Deviation

    • Row Count Deviation

    • Distinct Count Deviation

    • Null Count Deviation

Anomaly Description

Provides a concise explanation of the anomaly. When hovering over a description, complete information is revealed, which makes it easier for the assignee to address the identified issues effectively.

Detection Method

Displays the algorithm name used to detect the anomaly (Interquartile Range or Deviation).

Object Type

Displays the data object type with the anomaly (Schema, Table, or Table Column).

Object Name

Displays the name of the data object with the anomaly.

Anomaly Value

Displays the details on how the anomaly affects specific profiling statistics over time. Clicking this reveals a trend graph for further analysis.

Anomaly Remarks

Describe the change based on profiling statistics (e.g., "X% positive deviation from the previous value of Y").

Example:

  • For Deviation: Display the Anomaly Remark in the following format: “X% positive/negative deviation from the previous value of Y. Deviation threshold is Z%”

  • For IQR: Display the Anomaly Remark in the following format: “Detected value of X is outside the expected range of Y - Z”

Note: X, Y, and Z are the variables

Anomaly Status

Displays the current state of the anomaly

  • New - This signifies that the newly triggered anomaly has been reported but has not yet been acknowledged.

  • In Progress - This signifies that the triggered anomaly has been acknowledged, and the user has started working on fixing the issue.

  • Corrected - Corrected is applied when the remediation has been completed. It signifies that the issue has been resolved and is no longer a concern.

  • Void - Void is used to classify a reported entry as invalid. It is employed when an anomaly has been detected erroneously due to incorrect anomaly settings.

  • Allowed - This is used when a value is a valid exception and requires review. When users designate selected value(s) as "Allowed," they are removed from the Data Anomalies.

Note: When a user designates selected value(s) as “Corrected,” “Void,” or "Allowed," they are archived from the Data Anomalies as they have been acknowledged.

Assignee

Displays the name of the user who has been assigned to investigate and resolve the anomaly. Data object custodians are typically assigned by default. These assignees get notifications whenever there is a change in the anomaly status or whenever the assignee is updated. Only Author license users can be the assignee.

Executed By

Displays the user's name who triggered the anomaly detection on the specific object.

Updated By

Displays the name of the user who last modified the anomaly status.

Detection Date

Displays the date and time the anomaly was detected.

Profiled Date

Displays the date and time the data object was last profiled.

Updated On

Displays the date and time any changes were made to the anomaly.

Job ID

Displays the Job ID associated with detecting and reporting the anomaly.

Logs

Displays the Job logs for the specific Job ID.

Summary

This tab provides an overview of anomalies reported for each data object:

Field

Description

Anomaly Type

Category of the anomaly based on profiling statistics and detection algorithm.

Example:

  • IQR:

    • Table Count Deflection

    • Row Count Deflection

    • Distinct Count Deflection

    • Null Count Deflection

  • Deviation:

    • Table Count Deviation

    • Row Count Deviation

    • Distinct Count Deviation

    • Null Count Deviation

Object Type

Displays the type of object on which the anomaly has been detected and reported.

Anomaly Count

Displays the count of reported anomalies for the specific object.

Object Name

Displays the Name of the data object with anomalies.

Start Date

Displays the Date when the first anomaly was detected for the object.

End Date

Displays the Date when the last anomaly was detected for the object.

Was this helpful?