Hi all,

I'm not sure if this is a Spark issue, or an AWS/Hadoop/S3 driver issue, but 
I've noticed that I get a very slow response when I run:

val files = sc.wholeTextFiles("s3://emr-test-dgp/testfiles/").count()

(which will count all the files in the directory)

But an almost immediate response if I run this command with a wildcard added to 
the end:

val files = sc.wholeTextFiles("s3://emr-test-dgp/testfiles/*").count()

The time difference is in the order of 1 minute extra per 1000 files being 
listed from S3. The count returns the same value for each query.

This is on 1000s of files, with no sub-directories to confuse things. Has 
anyone seen anything similar?

Thanks,
Ewan
15/06/05 10:31:58 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend is 
ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 
30000(ms)
15/06/05 10:31:58 INFO cluster.YarnClusterScheduler: 
YarnClusterScheduler.postStartHook done
15/06/05 10:32:00 INFO metrics.MetricsSaver: Saved 3:61 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:32:02 INFO fs.EmrFileSystem: Consistency enabled, using 
com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2 as filesystem 
implementation
15/06/05 10:32:03 INFO storage.MemoryStore: ensureFreeSpace(257941) called with 
curMem=0, maxMem=280248975
15/06/05 10:32:03 INFO storage.MemoryStore: Block broadcast_0 stored as values 
in memory (estimated size 251.9 KB, free 267.0 MB)
15/06/05 10:32:03 INFO storage.MemoryStore: ensureFreeSpace(19668) called with 
curMem=257941, maxMem=280248975
15/06/05 10:32:03 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as 
bytes in memory (estimated size 19.2 KB, free 267.0 MB)
15/06/05 10:32:03 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in 
memory on ip-10-111-0-34.eu-west-1.compute.internal:50494 (size: 19.2 KB, free: 
267.2 MB)
15/06/05 10:32:03 INFO storage.BlockManagerMaster: Updated info of block 
broadcast_0_piece0
15/06/05 10:32:03 INFO spark.SparkContext: Created broadcast 0 from main at 
NativeMethodAccessorImpl.java:-2
15/06/05 10:32:10 INFO metrics.MetricsSaver: 1 aggregated 
AmazonDynamoDBv2GetItemDelay 59 raw values into 6 aggregated values, total 6
15/06/05 10:32:30 INFO metrics.MetricsSaver: Saved 17:286 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:33:00 INFO metrics.MetricsSaver: Saved 11:169 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:33:30 INFO metrics.MetricsSaver: Saved 11:175 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:33:44 INFO metrics.MetricsSaver: 101 aggregated DdbReadUnitGetItem 
53 raw values into 2 aggregated values, total 14
15/06/05 10:34:00 INFO metrics.MetricsSaver: Saved 11:172 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:34:30 INFO metrics.MetricsSaver: Saved 11:178 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:35:00 INFO metrics.MetricsSaver: Saved 11:273 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:35:13 INFO metrics.MetricsSaver: 201 aggregated DdbReadUnitGetItem 
50 raw values into 2 aggregated values, total 13
15/06/05 10:35:30 INFO metrics.MetricsSaver: Saved 11:172 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:35:35 INFO input.FileInputFormat: Total input paths to process : 
5001
15/06/05 10:36:00 INFO metrics.MetricsSaver: Saved 14:263 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:36:30 INFO metrics.MetricsSaver: Saved 11:322 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:36:33 INFO metrics.MetricsSaver: 301 aggregated 
AmazonS3GetObjectMetadataDelay 80 raw values into 3 aggregated values, total 3
15/06/05 10:37:00 INFO metrics.MetricsSaver: Saved 11:250 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:37:30 INFO metrics.MetricsSaver: Saved 11:265 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:37:46 INFO metrics.MetricsSaver: 401 aggregated 
AmazonDynamoDBv2GetItemDelay 75 raw values into 2 aggregated values, total 16
15/06/05 10:38:00 INFO metrics.MetricsSaver: Saved 11:214 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:38:18 INFO input.FileInputFormat: Total input paths to process : 
5001
15/06/05 10:38:19 INFO input.CombineFileInputFormat: DEBUG: Terminated node 
allocation with : CompletedNodes: 1, size left: 0
15/06/05 10:38:21 INFO spark.SparkContext: Starting job: isEmpty at 
JsonRDD.scala:51
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Got job 0 (isEmpty at 
JsonRDD.scala:51) with 1 output partitions (allowLocal=true)
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Final stage: Stage 0(isEmpty at 
JsonRDD.scala:51)
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Missing parents: List()
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Submitting Stage 0 
(CoalescedRDD[3] at main at NativeMethodAccessorImpl.java:-2), which has no 
missing parents
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to