Unable to improve ListStatus performance of ParquetRelation

Ditesh Kumar Wed, 20 Apr 2016 11:20:12 -0700

Hi,

When creating a DataFrame from a partitioned file structure (
sqlContext.read.parquet("s3://bucket/path/to/partitioned/parquet/filles")
), takes a lot of time to get list of files recursively from S3 when large
number of files are involved.
To circumvent this I wanted to override the FileStatusCache class
in HadoopFsRelation to create a new Relation which can fetch FileStatus
list from a cached source ( eg : MySql ). Currently this is not possible so
my questions are these :


   1. Is there any other way to do what I want to do ?
   2. If no to above, then can this extensibility be included by
   making FileStatusCache & related variable protected instead of private ?
   3. If yes to above, then can I help ?

Regards,
Ditesh Kumar

Unable to improve ListStatus performance of ParquetRelation

Reply via email to