Haohui Mai created HDFS-11588:
---------------------------------

             Summary: Output Avro format in the offline editlog viewer
                 Key: HDFS-11588
                 URL: https://issues.apache.org/jira/browse/HDFS-11588
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Haohui Mai
            Assignee: Haohui Mai


We found that it is handy to import the edit logs into query engines (e.g., 
Hive / Presto) to understand the usages of the cluster. Some examples include:

* The size of the data and the number of files that are written into a directory
* The distribution of the operations, for different directories.
* The number of files that are created by a user.

The answers to the above questions give insights on the usages of the clusters 
and have significant values on capacity planning.

Importing the edit log into query engines simplifies the tasks of answering 
these questions, and they can be answered efficiently.

While the Offline Editlog Viewer (OEV) supports outputting editlogs in XML 
formats, we found that it is time-consuming to transforming the XML format to 
formats that query engines recognize, because the generating the editlogs in 
XML formats and transforming them into formats that the query engine 
understands takes significant amount of time. In our environment it takes 
minutes to prepare a 100MB editlog file into a corresponding Parquet file.

This jira proposes to extend the OEV to output Avro files to make this process 
efficient. As an internal tool, the Avro output format has certain pre-defined 
schemas but it does not have the constraint of maintaining backward 
compatibility of the output, which is similar to the XML output format.







--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to