# HDFS (Hadoop Distributed File System)

This article outlines the integration with the HDFS (Hadoop Distributed File System) connector, enabling streamlined metadata management through features such as crawling, data preview, and manual lineage building. It also ensures secure authentication via Credential Manager.

<figure><img src="https://1813356899-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FhTnkoJQml0pok9awFDhx%2Fuploads%2FO3E19pJT4F8LH2JKpN3T%2Funknown.png?alt=media&#x26;token=2b083431-35bc-4732-b289-279bcaf5a606" alt=""><figcaption></figcaption></figure>

## Overview

### Connector Details

| Connector Category                                                       | File System             |
| ------------------------------------------------------------------------ | ----------------------- |
| OvalEdge Release Supported                                               | Relase4.0 to Release7.2 |
| <p>Connectivity</p><p>\[How the connection is established with HDFS]</p> | Hadoop SDK              |
| Verified HDFS Version                                                    | v6.0+                   |

{% hint style="info" %}
The HDFS connector has been validated with the mentioned "Verified HDFS Versions" and is expected to be compatible with other supported HDFS versions. If there are any issues with validation or metadata crawling, please submit a support ticket for investigation and feedback.
{% endhint %}

### Connector Features

| Feature                                      | Availability |
| -------------------------------------------- | :----------: |
| Crawling                                     |       ✅      |
| Delta Crawling                               |       ❌      |
| Profiling                                    |       ❌      |
| Query Sheet                                  |       ❌      |
| Data Preview                                 |       ✅      |
| Auto Lineage                                 |       ❌      |
| Manual Lineage                               |       ✅      |
| Secure Authentication via Credential Manager |       ✅      |
| Data Quality                                 |       ❌      |
| DAM (Data Access Management)                 |       ❌      |
| Bridge                                       |       ✅      |

### Metadata Mapping

The following objects are crawled from HDFS and mapped to the corresponding UI assets.

<table><thead><tr><th width="145.25">HDFS Object</th><th width="148.75">HDFS Attribute</th><th width="176">OvalEdge Attribute</th><th width="174">OvalEdge Category</th><th width="148">OvalEdge Type</th></tr></thead><tbody><tr><td>File/Folder</td><td>Folder</td><td>Folder</td><td>Folder</td><td>Folder</td></tr><tr><td>File/Folder</td><td>File</td><td>File</td><td>File</td><td>File</td></tr><tr><td>File</td><td>XLSX with sheets</td><td>File (Subfile)</td><td>File (Subfile)</td><td>File</td></tr><tr><td>File</td><td>XLS with sheets</td><td>File (Subfile)</td><td>File (Subfile)</td><td>File</td></tr><tr><td>File</td><td>CSV</td><td>File</td><td>File</td><td>File</td></tr><tr><td>File</td><td>TXT</td><td>File</td><td>File</td><td>File</td></tr><tr><td>File</td><td>PARQUET</td><td>File</td><td>File</td><td>File</td></tr><tr><td>File</td><td>ORC</td><td>File</td><td>File</td><td>File</td></tr><tr><td>File</td><td>JSON</td><td>File</td><td>File</td><td>File</td></tr><tr><td>File</td><td>YAML</td><td>File</td><td>File</td><td>File</td></tr><tr><td>File</td><td>PIP</td><td>File</td><td>File</td><td>File</td></tr></tbody></table>

## Set up a Connection

### Prerequisites

The following are the prerequisites to establish a connection.

Ensure that the CSV files adhere to the required formatting standards to ensure proper data processing and visibility. Refer to [CSV Format Requirements](https://docs.ovaledge.com/connectors/additional-requirements/csv-format-requirements-for-file-connectors).

### Service Account User Permissions

{% hint style="warning" %}
It is recommended to use a separate service account to establish the connection to the data source, configured with the following minimum set of permissions.
{% endhint %}

{% hint style="info" %}
👨‍💻**Who can provide these permissions?** These permissions are typically granted by the HDFS administrator, as users may not have the required access to assign them independently.
{% endhint %}

| Objects              | Operation            | Access Permissions         |
| -------------------- | -------------------- | -------------------------- |
| Connector Validation | Validate             | Read (r) on directory      |
| Crawling             | Crawling             | Read (r) on file           |
| Buckets              | Crawling & Profiling | Read (r) on file/directory |
| Folder               | Crawling & Profiling | Read (r) on file/directory |
| Files                | Crawling & Profiling | Read (r) on file/directory |
| profile/Get Data     | View Data            | Read (r) on file/directory |

### Connection Configuration Steps

{% hint style="warning" %}
Users are required to have the Connector Creator role in order to configure a new&#x20;

connection.
{% endhint %}

1. Log into OvalEdge, go to Administration > Connectors, click + **(New Connector)**, search for **HDFS**, and complete the required parameters.

{% hint style="info" %}
Fields marked with an asterisk (\*) are mandatory for establishing a connection.
{% endhint %}

<table><thead><tr><th width="217.75">Field Name</th><th>Description</th></tr></thead><tbody><tr><td>Connector Type</td><td>By default, "HDFS" is displayed as the selected connector type.</td></tr><tr><td>Authentication</td><td><p>Select the authentication type from the drop-down list.</p><ul><li>Kerberos</li><li>Non-Kerberos</li></ul></td></tr></tbody></table>

{% tabs %}
{% tab title="Kerberos" %}

| Field Name                | Description                                                                                                                                                                                                                                                                                  |
| ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Credential Manager\*      | <p>Select the desired credentials manager from the drop-down list. Relevant parameters will be displayed based on the selected option.</p><p>Supported Credential Managers:</p><ul><li>OE Credential Manager</li><li>AWS Secrets Manager</li><li>HashiCorp</li><li>Azure Key Vault</li></ul> |
| Connector Environment     | Select the environment (Example: PROD, STG) configured for the connector.                                                                                                                                                                                                                    |
| Connector Name\*          | <p>Enter a unique name for the HDFS connection              </p><p>(Example: "HDFSdb").</p>                                                                                                                                                                                                  |
| Connector Description     | Enter a brief description of the connector.                                                                                                                                                                                                                                                  |
| Connector Description     | <p>Enter a unique name for the HDFS connection              </p><p>(Example: "HDFSdb").</p>                                                                                                                                                                                                  |
| WebHdfs URL\*             | The endpoint URL of the Hadoop Distributed File System (HDFS) accessible via WebHDFS REST API.                                                                                                                                                                                               |
| Keytab\*                  | A secure file that contains encrypted Kerberos principals and keys for authentication.                                                                                                                                                                                                       |
| Principal\*               | The Kerberos Principal name used for authentication to the Hadoop cluster.                                                                                                                                                                                                                   |
| Krb5-Configuration File\* | Kerberos configuration file (krb5.conf) that defines realms, KDCs (Key Distribution Centers), and other Kerberos settings.                                                                                                                                                                   |
| {% endtab %}              |                                                                                                                                                                                                                                                                                              |

{% tab title="Non-Kerberos" %}

| Field Name            | Description                                                                                                                                                                                                                                                                                  |
| --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Credential Manager\*  | <p>Select the desired credentials manager from the drop-down list. Relevant parameters will be displayed based on the selected option.</p><p>Supported Credential Managers:</p><ul><li>OE Credential Manager</li><li>AWS Secrets Manager</li><li>HashiCorp</li><li>Azure Key Vault</li></ul> |
| Connector Environment | Select the environment (Example: PROD, STG) configured for the connector.                                                                                                                                                                                                                    |
| Connector Name\*      | <p>Enter a unique name for the HDFS connection              </p><p>(Example: "HDFSdb").</p>                                                                                                                                                                                                  |
| Connector Description | Enter a brief description of the connector.                                                                                                                                                                                                                                                  |
| WebHdfs URL\*         | The endpoint URL of the Hadoop Distributed File System (HDFS) accessible via WebHDFS REST API.                                                                                                                                                                                               |
| {% endtab %}          |                                                                                                                                                                                                                                                                                              |
| {% endtabs %}         |                                                                                                                                                                                                                                                                                              |

**Default Governance Roles**

<table data-header-hidden><thead><tr><th width="217.75"></th><th></th></tr></thead><tbody><tr><td>Default Governance Roles*</td><td>Select the appropriate users or teams for each governance role from the drop-down list. All users configured in the security settings are available for selection.</td></tr></tbody></table>

**Admin Roles**

<table data-header-hidden><thead><tr><th width="217.75"></th><th></th></tr></thead><tbody><tr><td>Admin Roles*</td><td><p>Select one or more users from the dropdown list for Integration Admin and Security &#x26; Governance Admin. All users configured</p><p>in the security settings are available for selection.</p></td></tr></tbody></table>

**No of Archive Objects**

<table data-header-hidden><thead><tr><th width="217.75"></th><th></th></tr></thead><tbody><tr><td>No Of Archive Objects*</td><td><p>This shows the number of recent metadata changes to a dataset at the source. By default, it is off. To enable it, toggle the Archive button and specify the number of objects to archive.</p><p>Example: Setting it to 4 retrieves the last four changes, displayed in the 'Version' column of the 'Metadata Changes' module.</p></td></tr></tbody></table>

**Bridge**

<table data-header-hidden><thead><tr><th width="217.75"></th><th></th></tr></thead><tbody><tr><td>Select Bridge*</td><td><p>If applicable, select the bridge from the drop-down list.</p><p>The drop-down list displays all active bridges that have been configured. These bridges facilitate communication between data sources and the system without requiring changes to firewall rules.</p></td></tr></tbody></table>

2. After entering all connection details, the following actions can be performed:
   * Click **Validate** to verify the connection.
   * Click **Save** to store the connection for future use.
   * Click **Save & Configure** to apply additional settings before saving.
3. The saved connection will appear on the Connectors home page.

## Manage Connector Operations

### Crawl

{% hint style="info" %}
To perform crawl operations, users must be assigned the Integration Admin role.
{% endhint %}

1. Navigate to the **Connectors** page and click **Crawl/Profile**.
2. This action initiates the metadata collection process from the data source and loads the retrieved metadata into the **File Manager**.
3. In the **File Manager**, select the specific folder(s) or file(s), then click **Catalog Files/Folders** from the Nine Dots menu.
4. The selected files or folders will be added to the **Data Catalog> Databases/Files/File Columns** tab.

### Other Operations

The Connectors page provides a centralized view of all configured connectors, along with their health status.

**Managing connectors includes**:

* **Connectors Health**: Displays the current status of each connector, with a green icon for active connections and a red icon for inactive connections, helping monitor connectivity to data sources.
* **Viewing**: Click the Eye icon next to the connector name to view connector details, including databases, tables, columns, and codes.

**Nine Dots Menu Options:**

To view, edit, validate, build lineage, configure, or delete connectors, click on the Nine Dots menu.

* **Edit Connector**: Update and revalidate the data source.
* **Validate Connector**: Check the connection's integrity.
* **Settings**: Modify connector settings.
  * **Crawler**: Configure data extraction.
  * **Access Instructions**: Add notes on how data can be accessed.
  * **Business Glossary Settings**: Manage term associations at the connector level.
* **Delete Connector**: Remove a connector with confirmation.

## Connectivity Troubleshooting

If incorrect parameters are entered, error messages may appear. Ensure all inputs are accurate to resolve these issues. If issues persist, contact the assigned support team.

<table><thead><tr><th width="85.25">S.No.</th><th width="284.75">Error Message(s)</th><th>Error Description &#x26; Resolution</th></tr></thead><tbody><tr><td>1</td><td>Error while validating HDFS connection Error occured while validating the HDFS connection : Can't get Kerberos realm</td><td>Error Description: The HDFS connection validation failed because the Kerberos realm could not be determined or accessed. This occurs due to an incorrect Kerberos configuration or a missing krb5.conf settings, or DNS issues.<br><br>Resolution: Verify that the Kerberos configuration file (krb5.conf) contains the correct realm and KDC details and is accessible to the application. Ensure proper DNS resolution, system time synchronization, and valid Kerberos credentials before retrying the connection.</td></tr><tr><td>2</td><td>Access Denied while attempting to list files in the specified HDFS folder.</td><td>Error Description: The file listing operation fails due to insufficient permissions, an invalid folder path, or connection issues. This occurs when the user lacks the required access rights, the specified path is invalid, or the connection to HDFS is not properly established.<br><br>Resolution: Ensure that the required permissions are available to access the folder. Verify that the folder path is correct and starts with a forward slash (e.g., /user/data). Confirm that the HDFS connection is active by testing the connection. If Kerberos authentication is used, validate that the principal has read permissions on the specified path. Attempt to list the root directory (/) to confirm basic access.</td></tr><tr><td>3</td><td>Error while uploading file to HDFS: Destination path does not exist.</td><td>Error Description: The file upload operation fails because the specified destination path does not exist in HDFS or is incorrectly formatted. This can occur due to an invalid folder path, a missing directory, insufficient permissions, or typographical errors in the path.<br><br>Resolution: Verify that the destination folder path is correct and exists in HDFS. Ensure that the path starts with a forward slash (for example, /user/data/uploads). Confirm that write permissions are available for the specified location. If the folder does not exist, create it before uploading the file, or contact the administrator if folder creation is restricted. Check for any typographical errors in the path and correct them before retrying the upload.</td></tr><tr><td>4<br></td><td>Error while validating HDFS connection: Kerberos authentication failed due to keytab file issues.</td><td>Error Description: HDFS connection validation fails when Kerberos authentication is selected because the keytab file cannot be accessed or validated. This issue occurs due to an invalid or missing keytab file, incorrect file path, insufficient file permissions, or an expired keytab.<br><br>Resolution: Verify that the keytab file path is correct and that the file exists on the system. Ensure that the file is accessible to the application with appropriate read permissions. Confirm that the keytab file is valid and has not expired or been regenerated. Obtain a valid keytab file from the administrator if required. Ensure that the file path uses forward slashes (/) even on Windows systems.</td></tr><tr><td><br>5</td><td>Kerberos principal authentication failed.</td><td>Error Description: Authentication fails despite providing a Kerberos principal due to an incorrect principal format, a mismatch between the principal and keytab file, or expired or invalid credentials.<br><br>Resolution: Verify that the principal format follows the standard pattern (username@REALM.COM). Ensure that the principal matches the entry in the keytab file. Confirm that the krb5.conf file path is correct and accessible. Validate that the keytab file is valid and not expired. Check with the administrator to ensure the principal is active and has the required HDFS access permissions.</td></tr><tr><td><br>6</td><td>Timeout error while connecting to HDFS.</td><td><p>Error Description: The connection attempt to HDFS times out. This issue occurs because of network latency, cluster overload, or an incorrect HDFS URL.</p><p>Resolution: Check network connectivity to the HDFS cluster, ensure the HDFS URL is correct, and verify the cluster is operational. Verify that firewall or VPN connections are properly configured, if applicable. Allow time and retry if the cluster is experiencing high load. If the issue persists, review and increase timeout settings as needed or contact the system administrator.</p></td></tr><tr><td><br>7</td><td>Errors encountered due to special characters in folder names.</td><td><p>Error Description: The issue occurs when folder names contain special characters, particularly the equals sign (=), which HDFS restricts or filters for security reasons. As a result, such folders are skipped during processing or access.</p><p>Resolution: Ensure that folder and file names follow standard naming conventions by using only letters, numbers, dashes, and underscores. Avoid special characters, especially the equals sign (=). If access to these folders is required, contact the system administrator or rename the folders to remove unsupported characters.</p></td></tr><tr><td><br>8</td><td>Connection additional attributes cannot be null.</td><td><p>Error Description: The HDFS connection validation fails because required connection configuration fields are missing. This occurs when mandatory attributes such as the HDFS URL, authentication type, or Kerberos-related details are not provided.</p><p>Resolution: Ensure that all required fields are completed during connection setup. Verify that the HDFS URL is specified and a valid authentication type (Kerberos or Non-Kerberos) is selected. If Kerberos authentication is used, confirm that all required Kerberos fields are properly configured. Recreate the connection after providing all mandatory details.</p></td></tr><tr><td><br>9</td><td>Error related to file extensions encountered while processing files.</td><td><p>Error Description: File processing fails due to an unsupported file extension or a mismatch in the specified extension filter. This can occur when the file extension is invalid, does not match the actual file type, or when filtering criteria exclude valid files.</p><p>Resolution: Ensure the file has a valid extension (e.g., .csv, .txt, or .json) and that it matches the actual file type. Verify that the specified extension filter uses the correct format. Confirm that the file type is supported by the system. If necessary, remove the extension filter to validate file visibility and processing.</p></td></tr><tr><td><br>10</td><td>An authentication attribute is required.</td><td><p>Error Description: The connection validation fails because the authentication type is not specified. This issue occurs when no authentication method is selected in the connection configuration.</p><p>Resolution: Ensure that an authentication type is selected during connection setup. Choose either Kerberos Authentication or Non-Kerberos Authentication, as this field is mandatory. If modifying an existing connection, verify that the authentication type remains selected. Save the connection settings again after confirming the configuration.</p></td></tr><tr><td><br>11</td><td>Connection errors encountered while attempting to connect to the HDFS cluster; unable to determine whether the issue is related to configuration settings or the HDFS environment.</td><td><p>Error Description: The connection to the HDFS cluster fails, and the root cause is unclear. This issue may arise from incorrect configuration settings, network connectivity issues, or problems within the HDFS cluster itself.</p><p>Resolution: Verify with the administrator that the HDFS URL is correct. Test network connectivity to the HDFS cluster. Attempt a connection using a simple tool or command to isolate the issue. Check whether other users can connect to determine whether the issue is specific to the configuration. If the issue persists, contact the administrator and provide the exact error message for further investigation.</p></td></tr></tbody></table>

## FAQs

<details>

<summary>Unsure which authentication type to select (Kerberos or Non-Kerberos)</summary>

Confirm the authentication method configured on the HDFS cluster with the HDFS administrator. Select Kerberos Authentication when Kerberos security is enabled in the environment, and choose Non-Kerberos Authentication for unsecured or test clusters. If the authentication type is unknown, begin with Non-Kerberos Authentication and switch to Kerberos if authentication errors occur.

</details>

<details>

<summary>Unable to view files or folders when browsing an HDFS directory</summary>

Ensure the required permissions are available to access the directory, verify that the folder path is correctly specified and begins with a forward slash (for example, /user/data), confirm that the HDFS connection is successful, validate that the Kerberos principal has read access to the specified path if authentication is enabled, and attempt to list the root directory (/) to confirm basic visibility.

</details>

<details>

<summary>Unable to locate a file in HDFS</summary>

Ensure the search is performed in the correct directory path; verify the file name for special characters or spaces that may affect the search; confirm that sufficient permissions are available to access the directory; browse the directory manually to view all files; and refresh the listing if the file was created recently.

</details>

<details>

<summary>File listing is slow when opening a folder with many files</summary>

This behavior is expected for directories containing a large number of files. To improve performance, use the search or filter options instead of loading the entire folder, navigate to a more specific subfolder where possible, and consult the administrator if the directory contains an excessive number of files that may require reorganization.

</details>

<details>

<summary>Unable to download a file from HDFS</summary>

1. Verify that the required read permissions are available for the file; ensure the specified file path is correct and complete; confirm that the HDFS connection is active and reconnect if necessary; allow sufficient time for large file downloads to complete; and validate that Kerberos authentication has not expired if security is enabled.

</details>

***

Copyright © 2026, OvalEdge LLC, Peachtree Corners, GA, USA.
