Re: Unexpected longer timer when doing the same training

Ángel Mon, 14 Oct 2024 10:08:56 -0700

Interesting ... could you provide the code and the dataset you're using to
reproduce the issue?


El lun, 14 oct 2024, 16:52, Olivier Sannier <obo...@free.fr> escribió:

> Hello,
>
> I have a small dataset that I use a test ground and which is working fine
> with Spark 2.1.1 to do some CART or RandomForest regression.
> It takes around 40 second to go from training to evaluation on a Spark
> context started locally via spark-submit.
>
> I'm now in the process of validating my migration to Spark 3.5.2 and while
> the base functionality is there, I'm facing a strange situation where the
> process mentioned above takes the same amount of time as with 2.1.1 except
> for rare occasions where it would take upwards of 3 minutes to complete.
>
> In the log, when it takes 40 seconds for the entire process, I see these
> timings from DecistionTree:
>
> INFO RandomForest: Internal timing for DecisionTree:
> INFO RandomForest:   init: 0.0094236 total: 4.1664414 findBestSplits:
> 4.1466146 chooseSplits: 4.1240084
>
> But when it is extremely slow, I see this:
>
> INFO RandomForest: Internal timing for DecisionTree:
> INFO RandomForest:   init: 0.004392 total: 138.2320968 findBestSplits:
> 138.2209867 chooseSplits: 138.2152113
>
> or this:
>
> INFO RandomForest: Internal timing for DecisionTree:
> INFO RandomForest:   init: 0.0032708 total: 218.9888818 findBestSplits:
> 218.9759717 chooseSplits: 218.9569924
>
> In the first case, there is an additional error message, that is
> apparently ignored:
>
> ERROR Inbox: Ignoring error
> java.lang.NullPointerException
>         at org.apache.spark.storage.BlockManagerMasterEndpoint.org
> $apache$spark$storage$BlockManagerMasterEndpoint$$register(BlockManagerMasterEndpoint.scala:677)
>         at
> org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:133)
>         at
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>         at org.apache.spark.rpc.netty.MessageLoop.org
> $apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>         at
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>         at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:829)
>
> But in the second "slow" case, I can't see any exception, but I see quite
> a few more blocks of "collectAsMap" jobs from RandomForest like so:
>
> INFO SparkContext: Created broadcast 20 from broadcast at
> RandomForest.scala:622
> INFO SparkContext: Starting job: collectAsMap at RandomForest.scala:663
> INFO DAGScheduler: Registering RDD 31 (mapPartitions at
> RandomForest.scala:644) as input to shuffle 6
> INFO DAGScheduler: Got job 8 (collectAsMap at RandomForest.scala:663) with
> 1 output partitions
> INFO DAGScheduler: Final stage: ResultStage 15 (collectAsMap at
> RandomForest.scala:663)
> INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 14)
> INFO DAGScheduler: Missing parents: List(ShuffleMapStage 14)
> INFO DAGScheduler: Submitting ShuffleMapStage 14 (MapPartitionsRDD[31] at
> mapPartitions at RandomForest.scala:644), which has no missing parents
> WARN DAGScheduler: Broadcasting large task binary with size 1295.7 KiB
> INFO MemoryStore: Block broadcast_21 stored as values in memory (estimated
> size 1295.7 KiB, free 4.6 GiB)
> INFO MemoryStore: Block broadcast_21_piece0 stored as bytes in memory
> (estimated size 615.3 KiB, free 4.6 GiB)
> INFO BlockManagerInfo: Added broadcast_21_piece0 in memory on
> computer.some.domain:62524 (size: 615.3 KiB, free: 4.6 GiB)
> INFO SparkContext: Created broadcast 21 from broadcast at
> DAGScheduler.scala:1585
> INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 14
> (MapPartitionsRDD[31] at mapPartitions at RandomForest.scala:644) (first 15
> tasks are for partitions Vector(0))
> INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks resource profile
> 0
> INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 14)
> (computer.some.domain, executor driver, partition 0, PROCESS_LOCAL, 9149
> bytes)
> INFO Executor: Running task 0.0 in stage 14.0 (TID 14)
> INFO BlockManager: Found block rdd_15_0 locally
> INFO Executor: Finished task 0.0 in stage 14.0 (TID 14). 1659 bytes result
> sent to driver
> INFO TaskSetManager: Finished task 0.0 in stage 14.0 (TID 14) in 17493 ms
> on computer.some.domain (executor driver) (1/1)
> INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all
> completed, from pool
> INFO DAGScheduler: ShuffleMapStage 14 (mapPartitions at
> RandomForest.scala:644) finished in 17,524 s
> INFO DAGScheduler: looking for newly runnable stages
> INFO DAGScheduler: running: HashSet()
> INFO DAGScheduler: waiting: HashSet(ResultStage 15)
> INFO DAGScheduler: failed: HashSet()
>
> Basically, the "Starting job: collectAsMap at RandomForest.scala:663" log
> line is repeated 10 times instead of 5 in the "fast case".
>
> This is a bit of a "shot in the dark" message in the hopes someone will
> have a "I know this" moment that could put me on the track of a potential
> culprit.
>
> Many thanks for any suggestion.
>
> Olivier
>
>

Re: Unexpected longer timer when doing the same training

Reply via email to