> For the complete documentation index, see [llms.txt](https://docs.ovaledge.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.ovaledge.com/release8.1/askedgi/user-guides/retrieval-augmented-generation-rag.md).

# Retrieval-Augmented Generation (RAG)

**Retrieval Augmented Generation (RAG)** in askEdgi refers to answering questions by first retrieving trusted enterprise context and then generating responses based on that context.

askEdgi is not a generic AI chatbot.

It does not rely on assumptions or general knowledge.

Instead, askEdgi:

* Understands business terms defined by the organization
* Knows what data exists and how it is governed
* Considers structural relationships between datasets
* Uses metadata, lineage, and contextual statistics
* Suggests relevant and trusted assets before execution
* Explains answers using enterprise context

This approach ensures that responses are grounded in how the organization understands and manages its data.

### Business Value of RAG

Business users frequently need answers to questions such as:

* Which dataset should be used for a metric?
* What does a business term mean?
* Where does a number originate?
* Why do different reports show different values?
* Which datasets should be combined?

These questions require understanding business meaning, structural relationships, and governance alignment.

Retrieval Augmented Generation ensures:

* Accurate answers
* Consistent interpretation
* Connected datasets
* Trustworthy analysis

### Enterprise Context Used by RAG

The enterprise context retrieved by askEdgi includes the following elements.

<table><thead><tr><th width="267.3333740234375">Context Type</th><th>Description</th></tr></thead><tbody><tr><td>Business glossary</td><td>Approved business definitions and terminology</td></tr><tr><td>Curated datasets</td><td>Trusted and governed data assets</td></tr><tr><td>Governance information</td><td>Ownership, classification, and access controls</td></tr><tr><td>Metadata and documentation</td><td>Business and technical descriptions</td></tr><tr><td>Dataset relationships</td><td>Structural connections between assets</td></tr><tr><td>Lineage information</td><td>Data movement and dependency paths</td></tr><tr><td>Contextual statistics</td><td>Sample characteristics and value distributions</td></tr><tr><td>Top values</td><td>Frequently occurring values used for interpretation</td></tr></tbody></table>

{% hint style="info" %}
Sample statistics and top value summaries improve interpretation but remain secondary to governed metadata and definitions.
{% endhint %}

#### askEdgi Modes and the Role of Business Context Layer (RAG)

askEdgi operates in two clearly defined modes to support different user needs. Each mode has a clear purpose and boundary that ensures trust and predictability.

#### **Analysis Mode**

Analysis Mode is the default and primary mode. It supports the complete journey from understanding a question to generating insights.

**Purpose of Analysis Mode**

Analysis Mode supports the following activities:

* Understanding a question
* Identifying the correct data (tables and files)
* Validating how datasets relate
* Performing analysis
* Receiving business-aligned explanations

Analysis Mode supports the complete journey from discovery to insight without switching between guidance and execution modes.

**Understand RAG Usage in Analysis Mode**

In **Analysis Mode**, Retrieval Augmented Generation performs the following functions:

* Interpret the business meaning behind a question
* Retrieve relevant glossary definitions (from both object and column level)
* Surface business and technical descriptions of objects
* Evaluate asset metadata and documentation
* Analyze relationships between objects
* Eliminate unrelated or disconnected objects
* Confirm that selected objects combine correctly

Retrieval Augmented Generation ensures reasoning in a business context before execution begins.

**Understand Relationship Aware Intelligence**

When a question spans multiple datasets, askEdgi performs structural validation.

askEdgi performs the following actions:

* Confirm that selected assets are structurally connected
* Avoid a combination of unrelated datasets
* Suggest additional related assets only when required
* Limit context expansion to what is necessary

This prevents:

* Incorrect joins
* Over-selection of irrelevant tables
* Misleading analysis
* Loss of trust

Datasets are validated as part of a connected data ecosystem rather than isolated objects.

**Understand Workspace First Execution**

Analysis Mode respects the Workspace as the execution boundary.

Execution rules are as follows:

* If tables remain pinned, analysis is restricted to pinned tables
* If tables are not pinned, eligible workspace tables are considered
* Additional catalog assets are surfaced only when necessary
* Data outside the intended scope is not analyzed

This ensures controlled execution.

### Discovery Mode

Discovery Mode supports structured exploration of the Data Catalog. This mode uses askEdgi’s Business Context Layer (RAG) and does not perform analysis execution.

**Purpose of Discovery Mode**

Discovery Mode supports:

* Asset browsing
* Data availability validation
* Metadata understanding
* Documentation review

**Understand Discovery Mode Behavior**

In Discovery Mode, askEdgi performs the following actions:

* Retrieve assets from the catalog
* Surface business descriptions and technical documentation
* Apply governance-aware filters
* Return metadata and definitions

RAG-based reasoning does not occur in this mode. Discovery Mode provides clarity without execution. This mode retrieves the attributes from the following object types:&#x20;

<table><thead><tr><th width="143.7999267578125">Object Type</th><th>Attribute</th></tr></thead><tbody><tr><td>Table</td><td>Name, Title, Synonyms, Source Description, Table location (e.g., catalog.schema.table), Profile status, Data quality index, Curation score, Rating, Certification, Critical Data Element, Navigation URL, Governance Roles, Business Glossary Association, Tags, Data Products, Custom Fields, Entity Relationships, Column Details (Column name, Business description), and Sample Data.</td></tr><tr><td>File</td><td>Name, Title, Synonyms, File location, File type, Profile status, Data quality index, Curation score, Rating, Certification (certified by, type, date), Critical Data Element, Navigation URL, Governance Roles, Business Glossary Association, Tags Classification, Data Products, Custom Fields, Column Details, and Sample Data </td></tr><tr><td>Report</td><td><p>Name, Title, Synonyms, Report location, </p><p>Report type, Parent report name, Curation score, Rating, Certification, Critical Data Element, Navigation URL, Governance Roles, Business Glossary Association, Tags, Data Products, Custom Fields, and Column Details. </p></td></tr><tr><td>API</td><td>Name, Title, Synonyms, API location / path, URL, Method, Endpoint, Curation score, Rating, Certification, Critical Data Element, Navigation URL, Governance Roles, Business Glossary Association, Tags Classification, Data Products, Custom Fields, and Attribute Details </td></tr><tr><td>Code</td><td>Name, Title, Code location, Job type, Curation score, Rating, Certification (certified by, type, date), Critical Data Element, Navigation URL, Business Glossary Association, Tags Classification, and Custom Fields</td></tr><tr><td>Schema </td><td>Name, Title, Synonyms, Database name, Curation score, Rating, Certification, Critical Data Element, Total table count, Navigation URL, Governance Roles, Business Glossary Association, Tags Classification, Custom Fields</td></tr><tr><td>Term </td><td>Name, Domain, Category, Subcategory, Governance status, Curation score, Rating, Navigation URL, Tags, and Custom Fields</td></tr><tr><td>Tag</td><td>Name, Navigation URL, and Custom Fields</td></tr><tr><td>Data Products</td><td>Name, Data Domain, Parent Subdomain, Status (Draft / Published), Criticality, Sensitivity (Public / Internal / Confidential / Restricted), Curation score, Rating, Navigation URL, Tags, and Custom Fields</td></tr><tr><td>Data Story</td><td>Name, Story Zone, Parent Story, Child Stories, Author, Rating, Navigation URL, and Tags Classification</td></tr></tbody></table>

### How askEdgi Finds the Right Context in Analysis Mode

The following sequence describes how askEdgi retrieves context and prepares for execution.

**Step 1: Determine Workspace Dependency**

askEdgi evaluates whether the request requires existing workspace data.

Workspace data is required when the request:

* Reads existing tables
* Computes metrics from data
* References workspace objects
* Validates schemas

Workspace data is not required when the request:

* Requests an example
* Requests a sample SQL or Python
* Requires logical reasoning without data
* Creates new structures without referencing existing data

This separation improves clarity and efficiency.

**Step 2: Evaluate Existing Workspace Context**

askEdgi checks whether sufficient context already exists within the Workspace.

If sufficient context exists, search expansion does not occur.

**Step 3: Enrich Business Understanding**

When additional clarity is required, askEdgi retrieves:

* Glossary definitions
* Asset descriptions
* Metadata context

This ensures the correct interpretation of business intent.

**Step 4: Suggest Relevant Assets**

When necessary, askEdgi identifies additional datasets aligned with business intent.

Only governed and relevant assets are considered.

**Step 5: Validate Dataset Compatibility**

Before execution, askEdgi confirms:

* Required attributes exist
* Datasets are structurally connected
* Necessary relationships are available
* Required elements are complete

If validation fails, execution does not proceed.

{% hint style="warning" %}
askEdgi stops and requests clarification instead of executing with incomplete data.
{% endhint %}

**Step 6: Execute Analysis**

Execution occurs only after:

* Context is sufficient
* Relationships are confirmed
* Required data elements exist

Execution is intentional and validated.

### How askEdgi Handles Missing or Incomplete Information

askEdgi stops intentionally to ensure accurate and trustworthy results.

askEdgi stops when:

* Required data is missing
* Structural compatibility cannot be confirmed
* Business context is unclear
* Confidence is insufficient

askEdgi does not:

* Guess
* Partially execute
* Assume schema

This results in predictable and trustworthy outcomes.

### RAG Trust in Analysis Mode

The RAG framework in askEdgi relies on controlled enterprise grounding.

RAG uses the following information sources:

* Curated business descriptions
* Technical documentation
* Structured metadata
* Relationship context between assets
* Lineage information
* Contextual data statistics, such as sample data characteristics
* Top 50 values

{% hint style="info" %}
Contextual data statistics and the top 50 values improve relevance and interpretation. Governed metadata and business definitions remain the primary reference.
{% endhint %}

**askEdgi enforces the following controls:**

* Respect governance and access rules
* Validate structural compatibility before dataset combination
* Confirm required attributes before execution
* Stop when information remains incomplete
* Maintain clear separation between discovery and execution

This layered grounding ensures business-aligned, structurally valid, and explainable responses.

### RAG Limitations

Certain behaviors remain intentionally restricted to maintain trust.

askEdgi avoids:

* Guessing or fabricating answers
* Ignoring governance controls
* Combining unrelated datasets
* Excessive expansion across the data ecosystem
* Execution with incomplete schema validation

Restraint remains a core system principle.

### Business Impact

Organizations gain the following benefits:

* Faster and safer data discovery
* Reduced dependency on technical teams
* Fewer incorrect dataset combinations
* Strong structural validation before analysis
* Higher trust in analytics and reporting
* Streamlined workflows without mode confusion
* Better alignment between business and data teams

askEdgi serves as a reliable entry point to enterprise data knowledge and trusted analysis.

### Summary

RAG forms the foundation that makes askEdgi:

* Context-aware instead of generic
* Relationship-aware instead of isolated
* Schema-aware instead of assumptive
* Dependency-aware instead of speculative
* Trusted instead of uncertain
* Business aligned instead of technically driven

Clear separation between Analysis Mode and Discovery Mode with structured validation before execution ensures intentional, explainable, and trustworthy interactions.

### Metadata-Aware Retrieval and Ranking

This section describes the enhanced Retrieval-Augmented Generation (RAG) capabilities in askEdgi, which introduce metadata-aware retrieval, governance-driven ranking, and intelligent embedding strategies to improve the relevance, trustworthiness, and explainability of results.

**Retrieval Architecture**

The askEdgi retrieval process follows a structured, multi-stage pipeline that balances semantic relevance with governance trust signals.

**Stage 1 – Relevance Retrieval**

* User query is transformed into a semantic embedding.
* The system performs:
  * Vector similarity search to capture semantic intent
  * Keyword-based search to capture exact matches across metadata attributes
* Results from both approaches are combined to generate a candidate set of data assets (typically 20–30 objects).

**Stage 2 – Governance Re-Ranking**

* Candidate assets are re-evaluated using governance and trust signals.
* Each asset is assigned a final ranking score based on relevance and governance strength.
* The top-ranked assets (typically 5–10) are selected and passed to downstream processing for response generation.

**Key Characteristics**

* Retrieval prioritizes both intent relevance and metadata quality.
* Governance signals influence ranking, ensuring that curated and trusted assets are preferred.
* Classification levels (e.g., Public, Internal, Restricted) are applied as access controls and do not influence ranking.

#### **Embedding Construction**

Each data asset is converted into a structured semantic representation prior to embedding generation.

**Context Construction**

Instead of treating metadata as independent attributes, the system constructs a business-context narrative that captures:

* Object title and description
* Business context and usage
* Glossary associations and hierarchy
* Tag hierarchy and classification
* Custom fields
* Governance indicators
* Data quality context

This contextual representation is then converted into a vector embedding.

**Key Characteristics**

* Improves semantic understanding of data assets
* Enhances alignment between user intent and retrieved results
* Enables more accurate and context-aware retrieval

**Ranking and Scoring Model**

The final ranking of data assets is determined using a combination of relevance and governance signals.

**Final Ranking Formula**

FinalScore =  (0.75 × Relevance Score) + (0.25 × Governance Score)

Where:

SearchRelevanceScore = (VectorSimilarity + KeywordScore) / 2

#### **Governance Score Calculation**

GovernanceScore represents the trustworthiness and curation level of a data asset.

**Contributing Signals**

* Critical Data Element (CDE) indicator
* Data Quality score
* Certification status
* Authoritative dataset flag

**Weight Distribution**

* CDE Indicator – 40%
* Data Quality Score – 30%
* Certification Status – 20%
* Authoritative Flag – 10%

**Key Characteristics**

* Prioritizes curated and enterprise-approved datasets
* Ensures that highly governed assets rank higher than unmanaged data
* Maintains balance between relevance and trust

#### **Metadata Curation Score Gate (Embedding Eligibility)**

A metadata curation score is used to determine whether a data asset is eligible for inclusion in the embedding and retrieval process.

**Funtionality**

* Assets are evaluated based on a composite metadata curation score.
* Only assets meeting the configured threshold are:
  * Embedded into the vector index
  * Considered during analysis and retrieval
* Assets below the threshold:
  * Remain available in the catalog
  * Are excluded from analysis-driven retrieval

**Configuration**

* Threshold is configurable (range: 0–100)
* The default value allows the inclusion of all assets
* Designed to support gradual adoption based on governance maturity

**Re-Evaluation**

* Eligibility is periodically re-evaluated
* Assets can automatically enter or exit the embedding pool based on updated metadata quality

**Steward Feedback**

* When assets fall below the threshold, stewards are provided with:
  * Current score
  * Required threshold
  * Key metadata gaps
  * Recommended improvements

#### **Custom Field Trust Signal Integration**

Custom fields are interpreted as governance signals to enhance trust scoring without requiring explicit configuration.

**Signal Identification**

* The system analyzes custom field names using predefined keyword patterns.
* Matching fields are mapped to governance signal categories such as:
  * CDE indicators
  * Data quality rules
  * Authoritative source indicators
  * Policy and compliance references

**Signal Processing**

* If a native governance signal is available, it is used directly.
* If not, matched custom fields act as proxy signals.

**Confidence Adjustment**

* Custom field–derived signals are applied with a confidence factor to ensure balanced scoring.

**Validation**

* Signals are applied only when associated with valid object types
* Prevents incorrect or irrelevant mappings

**Key Characteristics**

* Enables the utilization of governance information stored in custom fields
* Improves the accuracy of trust scoring across diverse implementations
* Reduces dependency on the strict standardization of metadata models

### Prompt Guidance for Next-Best Questions

askEdgi provides prompt guidance to assist users in continuing their analysis by suggesting relevant follow-up questions. This capability helps users refine or expand their queries without requiring manual prompt formulation.

After each user prompt is processed and a response is generated, the system evaluates:

* The user’s original question
* The intent identified by askEdgi
* The results returned (tables, insights, summaries)

Based on this, the system generates a set of relevant follow-up prompts that naturally extend the current analysis.

**How Prompt Guidance Works**

* Prompt suggestions are generated only after a response is produced
* Suggestions are derived from:
  * Current query intent
  * Result patterns (e.g., trends, anomalies, groupings)
  * Available metadata and context
* Each suggestion is:
  * Directly related to the current analysis
  * Focused on helping users go deeper or refine results
  * Written in simple, natural language

**Example flow:**

User asks:\
“Show delayed orders.”

askEdgi responds with results.

**Prompt Guidance suggests:**

* “Which vendors have the highest delays?”
* “Are delays increasing over time?”
* “Which regions are most affected?”

**Execution Boundaries**

Prompt Guidance does not participate in execution or data processing. Its role is limited to suggestion generation.

* It does not trigger:
  * Data retrieval
  * Query execution
  * Recipe execution
* It does not:
  * Modify the current results
  * Re-run analysis
  * Interfere with the main response

Suggestions are only recommendations, and execution happens only when the user selects a suggestion or enters a new prompt.

**User Interaction Model**

* Suggestions are displayed immediately after each response
* Users can:
  * Click on a suggested prompt to continue analysis
  * Ignore suggestions and enter their own query

When a suggestion is selected:

* It is treated as a new prompt
* askEdgi processes it through the standard flow (intent → execution → response)

***

Copyright © 2026, OvalEdge LLC, Peachtree Corners, GA USA


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.ovaledge.com/release8.1/askedgi/user-guides/retrieval-augmented-generation-rag.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Context Type	Description
Business glossary	Approved business definitions and terminology
Curated datasets	Trusted and governed data assets
Governance information	Ownership, classification, and access controls
Metadata and documentation	Business and technical descriptions
Dataset relationships	Structural connections between assets
Lineage information	Data movement and dependency paths
Contextual statistics	Sample characteristics and value distributions
Top values	Frequently occurring values used for interpretation
Object Type	Attribute
Table	Name, Title, Synonyms, Source Description, Table location (e.g., catalog.schema.table), Profile status, Data quality index, Curation score, Rating, Certification, Critical Data Element, Navigation URL, Governance Roles, Business Glossary Association, Tags, Data Products, Custom Fields, Entity Relationships, Column Details (Column name, Business description), and Sample Data.
File	Name, Title, Synonyms, File location, File type, Profile status, Data quality index, Curation score, Rating, Certification (certified by, type, date), Critical Data Element, Navigation URL, Governance Roles, Business Glossary Association, Tags Classification, Data Products, Custom Fields, Column Details, and Sample Data
Report	Name, Title, Synonyms, Report location, Report type, Parent report name, Curation score, Rating, Certification, Critical Data Element, Navigation URL, Governance Roles, Business Glossary Association, Tags, Data Products, Custom Fields, and Column Details.
API	Name, Title, Synonyms, API location / path, URL, Method, Endpoint, Curation score, Rating, Certification, Critical Data Element, Navigation URL, Governance Roles, Business Glossary Association, Tags Classification, Data Products, Custom Fields, and Attribute Details
Code	Name, Title, Code location, Job type, Curation score, Rating, Certification (certified by, type, date), Critical Data Element, Navigation URL, Business Glossary Association, Tags Classification, and Custom Fields
Schema	Name, Title, Synonyms, Database name, Curation score, Rating, Certification, Critical Data Element, Total table count, Navigation URL, Governance Roles, Business Glossary Association, Tags Classification, Custom Fields
Term	Name, Domain, Category, Subcategory, Governance status, Curation score, Rating, Navigation URL, Tags, and Custom Fields
Tag	Name, Navigation URL, and Custom Fields
Data Products	Name, Data Domain, Parent Subdomain, Status (Draft / Published), Criticality, Sensitivity (Public / Internal / Confidential / Restricted), Curation score, Rating, Navigation URL, Tags, and Custom Fields
Data Story	Name, Story Zone, Parent Story, Child Stories, Author, Rating, Navigation URL, and Tags Classification