Hi all,
I'm not sure if this is a Spark issue, or an AWS/Hadoop/S3 driver issue, but
I've noticed that I get a very slow response when I run:
val files = sc.wholeTextFiles("s3://emr-test-dgp/testfiles/").count()
(which will count all the files in the directory)
But an almost immediate response if I run this command with a wildcard added to
the end:
val files = sc.wholeTextFiles("s3://emr-test-dgp/testfiles/*").count()
The time difference is in the order of 1 minute extra per 1000 files being
listed from S3. The count returns the same value for each query.
This is on 1000s of files, with no sub-directories to confuse things. Has
anyone seen anything similar?
Thanks,
Ewan
15/06/05 10:31:58 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend is
ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime:
30000(ms)
15/06/05 10:31:58 INFO cluster.YarnClusterScheduler:
YarnClusterScheduler.postStartHook done
15/06/05 10:32:00 INFO metrics.MetricsSaver: Saved 3:61 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:32:02 INFO fs.EmrFileSystem: Consistency enabled, using
com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2 as filesystem
implementation
15/06/05 10:32:03 INFO storage.MemoryStore: ensureFreeSpace(257941) called with
curMem=0, maxMem=280248975
15/06/05 10:32:03 INFO storage.MemoryStore: Block broadcast_0 stored as values
in memory (estimated size 251.9 KB, free 267.0 MB)
15/06/05 10:32:03 INFO storage.MemoryStore: ensureFreeSpace(19668) called with
curMem=257941, maxMem=280248975
15/06/05 10:32:03 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as
bytes in memory (estimated size 19.2 KB, free 267.0 MB)
15/06/05 10:32:03 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in
memory on ip-10-111-0-34.eu-west-1.compute.internal:50494 (size: 19.2 KB, free:
267.2 MB)
15/06/05 10:32:03 INFO storage.BlockManagerMaster: Updated info of block
broadcast_0_piece0
15/06/05 10:32:03 INFO spark.SparkContext: Created broadcast 0 from main at
NativeMethodAccessorImpl.java:-2
15/06/05 10:32:10 INFO metrics.MetricsSaver: 1 aggregated
AmazonDynamoDBv2GetItemDelay 59 raw values into 6 aggregated values, total 6
15/06/05 10:32:30 INFO metrics.MetricsSaver: Saved 17:286 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:33:00 INFO metrics.MetricsSaver: Saved 11:169 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:33:30 INFO metrics.MetricsSaver: Saved 11:175 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:33:44 INFO metrics.MetricsSaver: 101 aggregated DdbReadUnitGetItem
53 raw values into 2 aggregated values, total 14
15/06/05 10:34:00 INFO metrics.MetricsSaver: Saved 11:172 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:34:30 INFO metrics.MetricsSaver: Saved 11:178 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:35:00 INFO metrics.MetricsSaver: Saved 11:273 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:35:13 INFO metrics.MetricsSaver: 201 aggregated DdbReadUnitGetItem
50 raw values into 2 aggregated values, total 13
15/06/05 10:35:30 INFO metrics.MetricsSaver: Saved 11:172 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:35:35 INFO input.FileInputFormat: Total input paths to process :
5001
15/06/05 10:36:00 INFO metrics.MetricsSaver: Saved 14:263 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:36:30 INFO metrics.MetricsSaver: Saved 11:322 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:36:33 INFO metrics.MetricsSaver: 301 aggregated
AmazonS3GetObjectMetadataDelay 80 raw values into 3 aggregated values, total 3
15/06/05 10:37:00 INFO metrics.MetricsSaver: Saved 11:250 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:37:30 INFO metrics.MetricsSaver: Saved 11:265 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:37:46 INFO metrics.MetricsSaver: 401 aggregated
AmazonDynamoDBv2GetItemDelay 75 raw values into 2 aggregated values, total 16
15/06/05 10:38:00 INFO metrics.MetricsSaver: Saved 11:214 records to
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:38:18 INFO input.FileInputFormat: Total input paths to process :
5001
15/06/05 10:38:19 INFO input.CombineFileInputFormat: DEBUG: Terminated node
allocation with : CompletedNodes: 1, size left: 0
15/06/05 10:38:21 INFO spark.SparkContext: Starting job: isEmpty at
JsonRDD.scala:51
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Got job 0 (isEmpty at
JsonRDD.scala:51) with 1 output partitions (allowLocal=true)
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Final stage: Stage 0(isEmpty at
JsonRDD.scala:51)
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Missing parents: List()
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Submitting Stage 0
(CoalescedRDD[3] at main at NativeMethodAccessorImpl.java:-2), which has no
missing parents
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]