Data Anomalies
Data anomalies are unexpected values that differ significantly from past data. These deviations disrupt typical patterns and can arise from data integration issues, process changes, migration errors, human mistakes, or system failures.
Anomalies can lead to inaccurate analysis, decisions, and reports, potentially harming businesses. Detecting and fixing them is vital for good data quality management, ensuring data is accurate, complete, consistent, and valid.
OvalEdge identifies anomalies during profiling by comparing newly profiled data with previously profiled data. Any inconsistencies are flagged as anomalies.
A list of detected anomalies is available for review in Data Anomalies. This section provides details like anomaly type, assigned person for correction, and other relevant information.
Data Anomaly Detection
Supported Objects: Schemas, Tables, and Table Columns (for RDBMS connectors only).
Algorithms: Deviation or Deflection (configurable by admins in System Settings).
Calculated Statistics: The table below presents the computed statistics associated with each object type.
Object Type
Profiling Statistics
Anomaly Type based on Algorithm
Schema
Table Count
Deviation:
Table Count Deviation
IQR:
Table Count Deflection
Table
Row Count
Deviation:
Row Count Deviation
IQR:
Row Count Deflection
Table Column
Distinct Count
Null Count
Deviation:
Distinct Count Deviation
Null Count Deviation
IQR:
Distinct Count Deflection
Null Count Deflection
Data Anomaly Algorithm
OvalEdge offers two anomaly detection algorithms that are configurable through System Settings.
Interquartile Range (IQR)
The Interquartile Range (IQR) algorithm is a statistical measure used to assess the spread or dispersion of a dataset. It focuses on the middle 50% of the data, specifically the range between the first quartile (Q1) and the third quartile (Q3).
The steps below are involved in the calculation:
Sort the Data:
Arrange the dataset in ascending order.
Calculate Quartiles:
Q1 (First Quartile): This is the median of the lower half of the dataset.
Q3 (Third Quartile): This is the median of the upper half of the dataset.
Calculate IQR:
IQR is determined by subtracting Q1 from Q3: IQR = Q3 - Q1.
Identify Outliers:
Any data points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are considered potential outliers.
Example: Consider a scenario where we have profiled a table with the following statistics: Profile Executions: Profile 1 Row Count - 40 Profile 2 Row Count - 67 Profile 3 Row Count - 87 Profile 4 Row Count - 76 Profile 5 Row Count - 85 Profile 6 Row Count - 70 Profile 7 Row Count - 80
The steps below are involved in the calculation:
Sort the Data:
Arrange the dataset in ascending order.
40, 67, 70, 76, 80, 85, 87
Calculate Quartiles:
Q2 (Second Quartile): 76 (Overall Median)
Q1 (First Quartile): Median of the First Half (40, 67, 70) = 67
Q3 (Third Quartile): Median of the Second Half (80, 85, 87) = 85
Calculate IQR:
IQR is determined by subtracting Q1 from Q3:
IQR = Q3 - Q1
IQR = 85 - 67 = 18
Identify Outliers:
Any data points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are considered potential outliers.
Q1 - 1.5 * IQR = 67 - 1.5 * 18 = 40
Q3 + 1.5 * IQR = 85 + 1.5 * 18 = 112
Now let us assume we have done the 8th Profile,
Case 1: We got the Row Count as 150
Case 2: We got the Row Count as 90
In Case 1, we consider it an anomaly and report it to the Data Anomalies because it does not fall under the range of 40 - 112.
In Case 2, we consider it a normal value and do not report it to the Data Anomalies because it falls under the range of 40 - 112.
Deviation
The Deviation Algorithm assesses the deviation percentage in profiling metrics among supported data objects. The Last Profiled and the Recently Profiled data are considered for anomaly detection using the deviation algorithm. The algorithm computes the percentage increase or decrease by comparing the recent and last profiled data statistics.
Example: Consider a scenario where we are profiling a table with the following statistics:
Row Count in First Profiling: 50
Row Count in Second Profiling: 85
Now, let's compute the deviation percentage:
Deviation Percentage = ((Recent Profiled - Last Profiled) / Last Profiled) * 100
Deviation Percentage = ((85 - 50) / 50) * 100
= 70%
The deviation percentage between the first and second profiling is 70%.
Case 1: If the user sets the threshold value at 50%, we consider this an anomaly and report it to the Data Anomalies as a positive deviation.
Case 2: If the user sets the threshold value at 75%, we consider this a normal value and do not report it.
Data Anomaly Detection Configuration
Levels:
Global: Apply settings across the entire application.
Connector: Configure settings for specific RDBMS connectors and override global settings for that connector.
Schema: Choose which schemas within an RDBMS connector to analyze for anomalies.
Object: Define custom settings for individual profiling statistics of each data object.
Global Level
Global settings are configured in Administration > System Settings and apply across the entire application.
Enable/Disable Anomaly Detection:
Key: anomaly detection analysis enabled
Default Value: TRUE
TRUE: Enables anomaly detection functionality.
FALSE: Disables anomaly detection functionality. Note: Disabling thegrays out connector and data object level settings.
Supported Anomaly Detection Algorithms:
Key: anomaly.detection.analysis.algorithm.list
Default Value: Deviation, IQR
Description: This setting defines the list of available algorithms for anomaly detection. Users can enter the algorithms' names separated by commas.
Active Anomaly Detection Algorithm:
Key: anomaly.detection.analysis.algorithm.selection
Default Value: Deviation
Description: This setting allows users to choose the primary algorithm used for anomaly detection.
IQR Threshold:
Key: anomaly.detection.analysis.iqr.range
Default Value: 1-50
Description: This setting defines the IQR algorithm's threshold range (percentage change) to identify anomalies. Values outside this range are flagged.
Deviation Threshold
Key: anomaly.detection.analysis.deviation.threshold
Default Value: 50
Description: This setting defines the threshold percentage (above or below) for the deviation algorithm to identify anomalies. Values exceeding this threshold are flagged.
Connector Level
Anomaly detection is only available for RDBMS connectors. Integration Admins can configure unique settings for each connector, overriding the global settings.
Configuration Steps:
Navigate to Administration > Connectors.
Select an RDBMS Connector (SQL Server) and access its "Nine Dots."
Select “Settings” and go to the "Anomaly Detection Settings" tab.
Choose "Custom Setting" and configure the following:
Enable/Disable anomaly detection for this specific connector.
Preferred anomaly detection algorithm (Deviation Percent or InterQuartile Range)
Threshold Range/Percentage for the chosen algorithm.
Schema Level
Integration Admins can choose which schemas within an RDBMS connector to analyze for anomalies.
Configuration Steps:
Go to Administration > Connectors.
Select an RDBMS connector.
Click the "Crawl/Profile" button.
Choose the number of schemas to include for anomaly detection.
Select "Crawl/Profile" or "Profile" (profiling is required).
Check "Execute Anomaly Detection."
Click "Run" to start background jobs that calculate anomalies for the chosen schemas.
Notes:
Anomaly detection settings are only available for RDBMS connectors.
Profiling is mandatory for anomaly detection to function.
Object Level
The Object Level custom settings are configured specifically for the selected profiling statistics of each data object.
Example: Null Count and Distinct Count are the two available profiling statistics for table columns. To configure custom settings for a table column titled "CHARACTER_SET_NAME," Null Count is selected first, and configuration is done accordingly. However, settings applied for Null Count will not be reflected for Distinct Count statistics.
Configuration Steps:
Navigate to Data Catalog. Select any supported object type (Schema, Table, Table Column).
Click on the Anomaly Detection Settings from the Nine Dots.
Select the Custom Setting Radio button.
Choose the preferred algorithm for anomaly detection.
Specify the threshold range/percentage for the selected algorithm.
Use the Skip Anomaly Detection option to exclude anomaly detection for the selected profiling statistics of the particular data object (Schema, Table, Table Column).
Note: The Skip Anomaly Detection option is available in the object-level settings, offering the flexibility to include or exclude the selected data object from anomaly detection. The object-level settings take precedence over the connector and global-level settings for Anomaly Detection.
Viewing Data Anomalies
The Data Anomalies page provides a central location to review all detected anomalies and their details.
All
This tab lists all identified anomalies and offers various filters and sorting options for detailed analysis:analysis:
Field
Description
Anomaly ID
Displays the unique identifier for each anomaly.
Anomaly Type
Displays the category of the anomaly based on profiling statistics and detection algorithms.
Example:
IQR:
Table Count Deflection
Row Count Deflection
Distinct Count Deflection
Null Count Deflection
Deviation:
Table Count Deviation
Row Count Deviation
Distinct Count Deviation
Null Count Deviation
Anomaly Description
Provides a concise explanation of the anomaly. When hovering over a description, complete information is revealed, which makes it easier for the assignee to address the identified issues effectively.
Detection Method
Displays the algorithm name used to detect the anomaly (Interquartile Range or Deviation).
Object Type
Displays the data object type with the anomaly (Schema, Table, or Table Column).
Object Name
Displays the name of the data object with the anomaly.
Anomaly Value
Displays the details on how the anomaly affects specific profiling statistics over time. Clicking this reveals a trend graph for further analysis.
Anomaly Remarks
Describe the change based on profiling statistics (e.g., "X% positive deviation from the previous value of Y").
Example:
For Deviation: Display the Anomaly Remark in the following format: “X% positive/negative deviation from the previous value of Y. Deviation threshold is Z%”
For IQR: Display the Anomaly Remark in the following format: “Detected value of X is outside the expected range of Y - Z”
Note: X, Y, and Z are the variables
Anomaly Status
Displays the current state of the anomaly
New - This signifies that the newly triggered anomaly has been reported but has not yet been acknowledged.
In Progress - This signifies that the triggered anomaly has been acknowledged, and the user has started working on fixing the issue.
Corrected - Corrected is applied when the remediation has been completed. It signifies that the issue has been resolved and is no longer a concern.
Void - Void is used to classify a reported entry as invalid. It is employed when an anomaly has been detected erroneously due to incorrect anomaly settings.
Allowed - This is used when a value is a valid exception and requires review. When users designate selected value(s) as "Allowed," they are removed from the Data Anomalies.
Note: When a user designates selected value(s) as “Corrected,” “Void,” or "Allowed," they are archived from the Data Anomalies as they have been acknowledged.
Assignee
Displays the name of the user who has been assigned to investigate and resolve the anomaly. Data object custodians are typically assigned by default. These assignees get notifications whenever there is a change in the anomaly status or whenever the assignee is updated. Only Author license users can be the assignee.
Executed By
Displays the user's name who triggered the anomaly detection on the specific object.
Updated By
Displays the name of the user who last modified the anomaly status.
Detection Date
Displays the date and time the anomaly was detected.
Profiled Date
Displays the date and time the data object was last profiled.
Updated On
Displays the date and time any changes were made to the anomaly.
Job ID
Displays the Job ID associated with detecting and reporting the anomaly.
Logs
Displays the Job logs for the specific Job ID.
Summary
This tab provides an overview of anomalies reported for each data object:
Field
Description
Anomaly Type
Category of the anomaly based on profiling statistics and detection algorithm.
Example:
IQR:
Table Count Deflection
Row Count Deflection
Distinct Count Deflection
Null Count Deflection
Deviation:
Table Count Deviation
Row Count Deviation
Distinct Count Deviation
Null Count Deviation
Object Type
Displays the type of object on which the anomaly has been detected and reported.
Anomaly Count
Displays the count of reported anomalies for the specific object.
Object Name
Displays the Name of the data object with anomalies.
Start Date
Displays the Date when the first anomaly was detected for the object.
End Date
Displays the Date when the last anomaly was detected for the object.
Was this helpful?

