Re: Opening many Parquet files = slow

Eric Eijkelenboom Mon, 13 Apr 2015 11:15:40 -0700

Hi guys

Does anyone know how to stop Spark from opening all Parquet files before 
starting a job? This is quite a show stopper for me, since I have 5000 Parquet 
files on S3.


Recap of what I tried: 

1. Disable schema merging with: sqlContext.load(“parquet", Map("mergeSchema" -> 
"false”, "path" -> “s3://path/to/folder"))
    This opens most files in the folder (17 out of 21 in my small example). For 
5000 files on S3, sqlContext.load() takes 30 minutes to complete. 

2. Use the old api with: 
sqlContext.setConf("spark.sql.parquet.useDataSourceApi", "false”)
    Now sqlContext.parquetFile() only opens a few files and prints the schema: 
so far so good! However, as soon as I run e.g. a count() on the dataframe, 
Spark still opens all files _before_ starting a job/stage. Effectively this 
moves the delay from load() to count() (or any other action I presume).

3. Run Spark 1.3.1-rc2.
    sqlContext.load() took about 30 minutes for 5000 Parquet files on S3, the 
same as 1.3.0.

Any help would be greatly appreciated!

Thanks a lot. 

Eric




> On 10 Apr 2015, at 16:46, Eric Eijkelenboom <eric.eijkelenb...@gmail.com> 
> wrote:
> 
> Hi Ted
> 
> Ah, I guess the term ‘source’ confused me :)
> 
> Doing:
> 
> sqlContext.load(“parquet", Map("mergeSchema" -> "false”, "path" -> “path to a 
> single day of logs")) 
> 
> for 1 directory with 21 files, Spark opens 17 files: 
> 
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening 
> 's3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000072'
>  
> <s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000072'>
>  for reading
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key 
> 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000072' for 
> reading at position '261573524'
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening 
> 's3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000074'
>  
> <s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000074'>
>  for reading
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening 
> 's3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000077'
>  
> <s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000077'>
>  for reading
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening 
> 's3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000062'
>  
> <s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000062'>
>  for reading
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key 
> 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000074' for 
> reading at position '259256807'
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key 
> 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000077' for 
> reading at position '260002042'
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key 
> 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000062' for 
> reading at position ‘260875275'
> etc.
> 
> I can’t seem to pass a comma-separated list of directories to load(), so in 
> order to load multiple days of logs, I have to point to the root folder and 
> depend on auto-partition discovery (unless there’s a smarter way). 
> 
> Doing: 
> 
> sqlContext.load(“parquet", Map("mergeSchema" -> "false”, "path" -> “path to 
> root log dir")) 
> 
> starts opening what seems like all files (I killed the process after a couple 
> of minutes).
> 
> Thanks for helping out. 
> Eric

Re: Opening many Parquet files = slow

Reply via email to