File Explorer

This document provides a step-by-step guide to the File Explorer of File Manager, which allows Authors to manage, organize, and analyze files and folders across various data lake systems, such as HDFS, S3, NFS, and more.

Getting Started

Connect to Data Lakes: Use connectors or upload tools to add data lakes like S3, HDFS, Google Drive, NFS, and more.
Catalog Data Lakes: Cataloging allows viewing files and folders within the Data Catalog (optional for some levels).
Explore Files and Folders: The File Explorer displays detailed information about files and folders within a data lake connection.

Key Features

Manage Files and Folders: Upload, download, delete, and organize files and folders within data lakes for NFS Connection.
Cataloging: Categorize files and folders for better visibility and data profiling.
Supported File Formats: Upload and manage various file formats, such as CSV, JSON, Parquet, and more. Configure allowed upload formats as needed.
Data Lake OvalSight: Provides a high-level overview of a data lake's structure, size, and file distribution.
Folder OvalSight: Analyze folder contents, including file types, sizes, and overall structure. Accessible from the File Manager or Data Catalog.
List View: Navigate data folders and subfolders visually and access detailed Folder Analysis information.
Data Lake Search: Search for files and folders across an entire data lake connection using keywords.
System Settings: Administrators and Authors can set the maximum upload file size, define allowed file types for cataloging, and control the number of file entries shown per page.

Catalog Data Lakes

Catalog data lakes before using File Explorer is essential. This allows Authors to view all files and folders within the lake. Various methods for adding data:

Connectors: OvalEdge integrates with data lake systems like Hadoop, Amazon S3, Google Drive, and more. These connectors are available on the "Connectors" page.
Upload Tools: Authors can also upload files and folders directly using the "Upload File" or "Upload Folder" tools for NFS Connection.

Crawl with Connectors

Navigate: Authors navigate to Administration > Crawler.
Add Connection: Select the file system (NFS/S3/HDFS/Azure/Drive) and enter the database name.
Provide Credentials: Enter and validate connection details in the "Manage Connection" window. Save.
Crawl Data: Click "Crawl/Profile" to initiate the process. Upon successful completion, folders and files will appear in the File Explorer.

File Explorer shows all connected files and folders.
The Data Catalog displays only first-level folders and files.
Authors must manually catalog additional levels from File Explorer.

Example:

In S3, "Hospital" (Level 1) gets automatically cataloged (visible in Data Catalog).
To view them, manually catalog "Departments" (Level 2) or "General Medicine" (Level 3) from the File Explorer.

Upload Files via NFS

Authors can upload files and folders directly to the NFS data lake connection.

Access Upload: In File Explorer, select the NFS data lake and click the 9-Dots icon to access the "Upload" option.
Choose File/Folder: Select "File" or "Folder" on the upload page.
Browse and Upload: Browse the computer directory to select the file or folder, then initiate the upload.
Create Directory (Optional): Use the 9-Dots icon to create a new directory if needed.
Verify and Finish: A successful upload will highlight the file in green. Click "Finish" to complete.

Supported File Formats for Upload

The File Explorer supports specific file types. Authors can configure these types through the "config.file.types.to.be.cataloged" setting in System Settings (OTHERS tab). Provide the valid file format for upload.

Supported File Formats

Once uploaded and cataloged in the Data Catalog, the following file formats can be profiled:

CSV (.csv): Comma Separated Values stores tabular data, with each line representing a record and commas separating fields.
JSON (.json): JavaScript Object Notation stores simple data structures for easy data interchange between applications and servers.
Parquet (.parquet): An Apache Parquet file format that is efficient for storing and processing large datasets.
ORC (.orc): Optimized Row Columnar files are used in the Hadoop ecosystem for structured data storage.
XLSX (.xlsx): Microsoft Excel Open XML Spreadsheet format.
XLS (.xls): Microsoft Office Excel spreadsheet format containing rows and columns of data.
Avro (.avro): Apache Avro is a data serialization framework for efficient data exchange with features like schema evolution.
Gzip (.gz): Compressed files using the gzip algorithm for reduced size and faster transmission.

PreviousAdditional Features NextSelect Data Lake

Last updated 4 months ago

Was this helpful?