Hi all, I'm not sure if this is a Spark issue, or an AWS/Hadoop/S3 driver issue, but I've noticed that I get a very slow response when I run:
val files = sc.wholeTextFiles("s3://emr-test-dgp/testfiles/").count() (which will count all the files in the directory) But an almost immediate response if I run this command with a wildcard added to the end: val files = sc.wholeTextFiles("s3://emr-test-dgp/testfiles/*").count() The time difference is in the order of 1 minute extra per 1000 files being listed from S3. The count returns the same value for each query. This is on 1000s of files, with no sub-directories to confuse things. Has anyone seen anything similar? Thanks, Ewan
15/06/05 10:31:58 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms) 15/06/05 10:31:58 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done 15/06/05 10:32:00 INFO metrics.MetricsSaver: Saved 3:61 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:32:02 INFO fs.EmrFileSystem: Consistency enabled, using com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2 as filesystem implementation 15/06/05 10:32:03 INFO storage.MemoryStore: ensureFreeSpace(257941) called with curMem=0, maxMem=280248975 15/06/05 10:32:03 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 251.9 KB, free 267.0 MB) 15/06/05 10:32:03 INFO storage.MemoryStore: ensureFreeSpace(19668) called with curMem=257941, maxMem=280248975 15/06/05 10:32:03 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.2 KB, free 267.0 MB) 15/06/05 10:32:03 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-10-111-0-34.eu-west-1.compute.internal:50494 (size: 19.2 KB, free: 267.2 MB) 15/06/05 10:32:03 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0 15/06/05 10:32:03 INFO spark.SparkContext: Created broadcast 0 from main at NativeMethodAccessorImpl.java:-2 15/06/05 10:32:10 INFO metrics.MetricsSaver: 1 aggregated AmazonDynamoDBv2GetItemDelay 59 raw values into 6 aggregated values, total 6 15/06/05 10:32:30 INFO metrics.MetricsSaver: Saved 17:286 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:33:00 INFO metrics.MetricsSaver: Saved 11:169 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:33:30 INFO metrics.MetricsSaver: Saved 11:175 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:33:44 INFO metrics.MetricsSaver: 101 aggregated DdbReadUnitGetItem 53 raw values into 2 aggregated values, total 14 15/06/05 10:34:00 INFO metrics.MetricsSaver: Saved 11:172 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:34:30 INFO metrics.MetricsSaver: Saved 11:178 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:35:00 INFO metrics.MetricsSaver: Saved 11:273 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:35:13 INFO metrics.MetricsSaver: 201 aggregated DdbReadUnitGetItem 50 raw values into 2 aggregated values, total 13 15/06/05 10:35:30 INFO metrics.MetricsSaver: Saved 11:172 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:35:35 INFO input.FileInputFormat: Total input paths to process : 5001 15/06/05 10:36:00 INFO metrics.MetricsSaver: Saved 14:263 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:36:30 INFO metrics.MetricsSaver: Saved 11:322 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:36:33 INFO metrics.MetricsSaver: 301 aggregated AmazonS3GetObjectMetadataDelay 80 raw values into 3 aggregated values, total 3 15/06/05 10:37:00 INFO metrics.MetricsSaver: Saved 11:250 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:37:30 INFO metrics.MetricsSaver: Saved 11:265 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:37:46 INFO metrics.MetricsSaver: 401 aggregated AmazonDynamoDBv2GetItemDelay 75 raw values into 2 aggregated values, total 16 15/06/05 10:38:00 INFO metrics.MetricsSaver: Saved 11:214 records to /mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin 15/06/05 10:38:18 INFO input.FileInputFormat: Total input paths to process : 5001 15/06/05 10:38:19 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 0 15/06/05 10:38:21 INFO spark.SparkContext: Starting job: isEmpty at JsonRDD.scala:51 15/06/05 10:38:21 INFO scheduler.DAGScheduler: Got job 0 (isEmpty at JsonRDD.scala:51) with 1 output partitions (allowLocal=true) 15/06/05 10:38:21 INFO scheduler.DAGScheduler: Final stage: Stage 0(isEmpty at JsonRDD.scala:51) 15/06/05 10:38:21 INFO scheduler.DAGScheduler: Parents of final stage: List() 15/06/05 10:38:21 INFO scheduler.DAGScheduler: Missing parents: List() 15/06/05 10:38:21 INFO scheduler.DAGScheduler: Submitting Stage 0 (CoalescedRDD[3] at main at NativeMethodAccessorImpl.java:-2), which has no missing parents
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org