I can’t reproduce your issue with len=10000 in local mode.
Could you give more environment information?
> On Aug 9, 2016, at 11:35, Shane Lee <shane_y_...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> 
> I am trying out SparkR 2.0 and have run into an issue with repartition. 
> 
> Here is the R code (essentially a port of the pi-calculating scala example in 
> the spark package) that can reproduce the behavior:
> 
> schema <- structType(structField("input", "integer"), 
>     structField("output", "integer"))
> 
> library(magrittr)
> 
> len = 3000
> data.frame(n = 1:len) %>%
>     as.DataFrame %>%
>     SparkR:::repartition(10L) %>%
>       dapply(., function (df)
>       {
>               library(plyr)
>               ddply(df, .(n), function (y)
>               {
>                       data.frame(z = 
>                       {
>                               x1 = runif(1) * 2 - 1
>                               y1 = runif(1) * 2 - 1
>                               z = x1 * x1 + y1 * y1
>                               if (z < 1)
>                               {
>                                       1L
>                               }
>                               else
>                               {
>                                       0L
>                               }
>                       })
>               })
>       }
>       , schema
>       ) %>% 
>       SparkR:::summarize(total = sum(.$output)) %>% collect * 4 / len
> 
> For me it runs fine as long as len is less than 5000, otherwise it errors out 
> with the following message:
> 
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 
> in stage 56.0 failed 4 times, most recent failure: Lost task 6.3 in stage 
> 56.0 (TID 899, LARBGDV-VM02): org.apache.spark.SparkException: R computation 
> failed with
>  Error in readBin(con, raw(), stringLen, endian = "big") : 
>   invalid 'n' argument
> Calls: <Anonymous> -> readBin
> Execution halted
>       at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>       at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
>       at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
>       at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:178)
>       at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:175)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>       at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$
> 
> If the repartition call is removed, it runs fine again, even with very large 
> len.
> 
> After looking through the documentations and searching the web, I can't seem 
> to find any clues how to fix this. Anybody has seen similary problem?
> 
> Thanks in advance for your help.
> 
> Shane
> 

Reply via email to