[jira] [Commented] (HUDI-431) Design and develop parquet logging in Log file

Vinoth Chandar (Jira) Thu, 06 May 2021 22:37:08 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340578#comment-17340578
 ]


Vinoth Chandar commented on HUDI-431:
-------------------------------------

[~szhou] Not sure how read-ahead works with files backed on cloud storage (as 
opposed on the local linux fs for e.g). 

Hudi is designed around a log format, that can encode changes to a base file as 
"blocks". Blocks have headers, footers and a byte[] payload.

Blocks can be of different types 
 * Command blocks, record rollbacks or other metadata related commands
 * Delete blocks, encode keys that are deleted from base file
 * Data blocks, encode new inserts/updates on top of base file. 

Currently data block payloads contain "avro" records written out as a single 
byte[]. This is great for streaming ingest scenarios, since its row oriented 
and fast. But for regular batch jobs, it's preferrable to have data blocks in 
"parquet", to do columnar reads during merging/compaction. 

With this Jira the idea is, 

During writing : we will take a List<GenericRecord>, convert it to a byte[] 
that is same as what a parquet file with those records would contain on 
storage. Store it as the payload for the data block. 

During read : we want to read just the payload part of  the data block, as a 
parquet file. i.e no need to read the entire content into memory and we can 
simply read the columns we are interested in. This is wht the InlineFileSystem 
code already helps with. 

 

TestInLineFileSystemHFilleInlining has an example of the same thing for HFile 
format. 

 

Hope that helps. 

> Design and develop parquet logging in Log file
> ----------------------------------------------
>
>                 Key: HUDI-431
>                 URL: https://issues.apache.org/jira/browse/HUDI-431
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: Storage Management
>            Reporter: sivabalan narayanan
>            Assignee: Vinoth Chandar
>            Priority: Major
>              Labels: help-requested
>
> We have a basic implementation of inline filesystem, to read a file format 
> like Parquet, embedded "inline" into another file.  
> [https://github.com/apache/hudi/blob/master/hudi-common/src/test/java/org/apache/hudi/common/fs/inline/TestInLineFileSystem.java]
>  for sample usage.
>  This idea here is to see if we can embed parquet/hfile formats into the Hudi 
> log files, to get columnar reads on the delta log files as well. This helps 
> us speed up query performance, given the log is row based today. Once Inline 
> FS is available, enable parquet logging support with HoodieLogFile. LogFile 
> can expose a writer (essentially ParquetWriter) and users can write records 
> as though writing to parquet files. Similarly on the read path, a reader 
> (parquetReader) will be exposed which the user can use to read data out of 
> it. 
> This Jira tracks work to implement such parquet inlining into the log format 
> and have the writer and reader use it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-431) Design and develop parquet logging in Log file

Reply via email to