Reading Hadoop Archive from Spark

To Quoc Cuong Mon, 27 Apr 2020 04:16:46 -0700

 Hello,
After archiving parquets into a HAR (Hadoop Archive) file, its data format has 
the following layout:foo.har/_masterindex //stores hashes and offsets


foo.har/_index //stores file statuses

foo.har/part-[0..n] //stores actual parquet files combined in sequential
So, we can access parquet file inside the HAR by:
spark.read.parquet("hdfs:///user/cyber/dataset/HARFolder/foo.har/")
or second way:
spark.read.parquet("hdfs:///user/cyber/dataset/HARFolder/foo.har/part-0")
If the HAR file contains only 1 parquet, we can read entirely. But if the HAR 
file contains more than 1 parquet, we can read only the first parquet, and 
ignore the other parquets inside this HAR. This is because the archiving 
process treats parquet files as binary files, and it just appends these 
multiple binary files into part-0 file. So only the header of the first parquet 
file is placed on the header of the part-0 file. The headers of other parquet 
files are placed in somewhere in the middle of part-0 file. So when we use 
spark.read.parquet("hdfs:///foo.har/part-0"), it scans only the header of 
part-0, which is also the header of first parquet, and skips the rest.

For example, if foo.har contains tintin_milou.parquet, we can read 
successfully. But if foo2.har contains 2 parquets (tintin_milou.parquet and 
tintin_milou2.parquet), we can read only tintin_milou.parquet, and fail to read 
tintin_milou2.parquet. Furthermore, if foo3.har contains 2 parquets that has 
different schema (like tintin_.milou.parquet and cdr.parquet), we cannot read 
both of them.

We CANNOT access original parquets by these 2 ways:
spark.read.parquet("hdfs:///user/cyber/dataset/HARFolder/foo2.har/tintin_milou.parquet")
spark.read.parquet("har:///user/cyber/dataset/HARFolder/foo2.har/tintin_milou.parquet")
Even though we can access this original parquet by hadoop:
hadoop dfs -ls har:///user/cyber/dataset/HARFolder/foo2.har
Output: (assume that tintin_milou.parquet and tintin_milou2.parquet are 
archived into foo2.har)
har:///user/cyber/dataset/HARFolder/foo2.har/tintin_milou.parquet
har:///user/cyber/dataset/HARFolder/foo2.har/tintin_milou2.parquet

So does anyone know how to read multiple parquet files inside a HAR with Spark ?
Thanks

Reading Hadoop Archive from Spark

Reply via email to