Hi, When creating a DataFrame from a partitioned file structure ( sqlContext.read.parquet("s3://bucket/path/to/partitioned/parquet/filles") ), takes a lot of time to get list of files recursively from S3 when large number of files are involved. To circumvent this I wanted to override the FileStatusCache class in HadoopFsRelation to create a new Relation which can fetch FileStatus list from a cached source ( eg : MySql ). Currently this is not possible so my questions are these :
1. Is there any other way to do what I want to do ? 2. If no to above, then can this extensibility be included by making FileStatusCache & related variable protected instead of private ? 3. If yes to above, then can I help ? Regards, Ditesh Kumar