[jira] [Updated] (SPARK-17777) Spark Scheduler Hangs Indefinitely

Ameen Tayyebi (JIRA) Tue, 04 Oct 2016 11:30:13 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-17777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ameen Tayyebi updated SPARK-17777:
----------------------------------
    Description: 
We've identified a problem with Spark scheduling. The issue manifests itself 
when an RDD calls SparkContext.parallelize within its getPartitions method. 
This seemingly "recursive" call causes the problem. We have a repro case that 
can easily be run.

Please advise on what the issue might be and how we can work around it in the 
mean time.

I've attached repro.scala which can simply be pasted in spark-shell to 
reproduce the problem.

Why are we calling sc.parallelize in production within getPartitions? Well, we 
have an RDD that is composed of several thousands of Parquet files. To compute 
the partitioning strategy for this RDD, we create an RDD to read all file sizes 
from S3 in parallel, so that we can quickly determine the proper partitions. We 
do this to avoid executing this serially from the master node which can result 
in significant slowness in the execution. Pseudo-code:

val splitInfo = sc.parallelize(filePaths).map(f => (f, 
s3.getObjectSummary)).collect()

A similar logic is used in DataFrame by Spark itself:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902
 

Thanks,
-Ameen

  was:
We've identified a problem with Spark scheduling. The issue manifests itself 
when an RDD calls SparkContext.parallelize within its getPartitions method. 
This seemingly "recursive" call causes the problem. We have a repro case that 
can easily be run.

Please advise on what the issue might be and how we can work around it in the 
mean time.

Thanks,
-Ameen


> Spark Scheduler Hangs Indefinitely
> ----------------------------------
>
>                 Key: SPARK-17777
>                 URL: https://issues.apache.org/jira/browse/SPARK-17777
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: AWS EMR 4.3, can also be reproduced locally
>            Reporter: Ameen Tayyebi
>         Attachments: repro.scala
>
>
> We've identified a problem with Spark scheduling. The issue manifests itself 
> when an RDD calls SparkContext.parallelize within its getPartitions method. 
> This seemingly "recursive" call causes the problem. We have a repro case that 
> can easily be run.
> Please advise on what the issue might be and how we can work around it in the 
> mean time.
> I've attached repro.scala which can simply be pasted in spark-shell to 
> reproduce the problem.
> Why are we calling sc.parallelize in production within getPartitions? Well, 
> we have an RDD that is composed of several thousands of Parquet files. To 
> compute the partitioning strategy for this RDD, we create an RDD to read all 
> file sizes from S3 in parallel, so that we can quickly determine the proper 
> partitions. We do this to avoid executing this serially from the master node 
> which can result in significant slowness in the execution. Pseudo-code:
> val splitInfo = sc.parallelize(filePaths).map(f => (f, 
> s3.getObjectSummary)).collect()
> A similar logic is used in DataFrame by Spark itself:
> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902
>  
> Thanks,
> -Ameen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-17777) Spark Scheduler Hangs Indefinitely

Reply via email to