Spark SQL 1.3.1　"saveAsParquetFile"　will output tachyon file with different block size

zhangxiongfei Mon, 13 Apr 2015 04:14:20 -0700

Hi experts
I run below code  in Spark Shell to access parquet files in Tachyon.
1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
val ta3 
=sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m");
2.Second, set the "fs.local.block.size" to 256M to make sure that block size of 
output files in Tachyon is 256M.
   sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
3.Third,saved above DataFrame into Parquet files that is stored in Tachyon
  
ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test");
After above code run successfully, the output parquet files were stored in 
Tachyon,but these files have different block size,below is the information of 
those files in the path 
"tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test":
    File Name                     Size              Block Size     In-Memory    
 Pin     Creation Time
 _SUCCESS                      0.00 B           256.00 MB     100%         NO   
  04-13-2015 17:48:23:519
_common_metadata      1088.00 B      256.00 MB     100%         NO     
04-13-2015 17:48:23:741
_metadata                       22.71 KB       256.00 MB     100%         NO    
 04-13-2015 17:48:23:646
part-r-00001.parquet     177.19 MB     32.00 MB      100%         NO     
04-13-2015 17:46:44:626
part-r-00002.parquet     177.21 MB     32.00 MB      100%         NO     
04-13-2015 17:46:44:636
part-r-00003.parquet     177.02 MB     32.00 MB      100%         NO     
04-13-2015 17:46:45:439
part-r-00004.parquet     177.21 MB     32.00 MB      100%         NO     
04-13-2015 17:46:44:845
part-r-00005.parquet     177.40 MB     32.00 MB      100%         NO     
04-13-2015 17:46:44:638
part-r-00006.parquet     177.33 MB     32.00 MB      100%         NO     
04-13-2015 17:46:44:648


It seems that the API saveAsParquetFile does not distribute/broadcast the 
hadoopconfiguration to executors like the other API such as saveAsTextFile.The 
configutation "fs.local.block.size" only take effects on Driver.
If I set that configuration before loading parquet files,the problem is gone.
Could anyone help me verify this problem?

Thanks
Zhang Xiongfei

Spark SQL 1.3.1 "saveAsParquetFile" will output tachyon file with different block size

Reply via email to

Spark SQL 1.3.1　"saveAsParquetFile"　will output tachyon file with different block size