Re: Size exceeds Integer.MAX_VALUE issue with RandomForest

2017-09-18 Thread Pulluru Ranjith
Hi, Here are the commands that are used. - > spark.default.parallelism=1000 > sparkR.session() Java ref type org.apache.spark.sql.SparkSession id 1 > sql("use test") SparkDataFrame[] > mydata <-sql("select c1 ,p1 ,rt1 ,c2 ,p2 ,rt2 ,avt,avn from test_temp2 where vdr = 'TEST31X' ") > > nrow(myda

Re: Size exceeds Integer.MAX_VALUE issue with RandomForest

2017-09-16 Thread Akhil Das
What are the parameters you passed to the classifier and what is the size of your train data? You are hitting that issue because one of the block size is over 2G, repartitioning the data will help. On Fri, Sep 15, 2017 at 7:55 PM, rpulluru wrote: > Hi, > > I am using sparkR randomForest function

Re: Size exceeds Integer.MAX_VALUE

2016-07-24 Thread Andrew Ehrlich
You can use the .repartition() function on the rdd or dataframe to set the number of partitions higher. Use .partitions.length to get the current number of partitions. (Scala API). Andrew > On Jul 24, 2016, at 4:30 PM, Ascot Moss wrote: > > the data set is the training data set for random for

Re: Size exceeds Integer.MAX_VALUE

2016-07-24 Thread Ascot Moss
the data set is the training data set for random forest training, about 36,500 data, any idea how to further partition it? On Sun, Jul 24, 2016 at 12:31 PM, Andrew Ehrlich wrote: > It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 which > limits the size of the blocks in th

Re: Size exceeds Integer.MAX_VALUE

2016-07-23 Thread Andrew Ehrlich
It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 which limits the size of the blocks in the file being written to disk to 2GB. If so, the solution is for you to try tuning for smaller tasks. Try increasing the number of pa

RE: Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Ewan Leith
How big do you expect the file to be? Spark has issues with single blocks over 2GB (see https://issues.apache.org/jira/browse/SPARK-1476 and https://issues.apache.org/jira/browse/SPARK-6235 for example) If you don’t know, try running df.repartition(100).write.format… to get an idea of how big

Re: Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Sabarish Sasidharan
You can try increasing the number of partitions before writing it out. Regards Sab On Mon, Nov 16, 2015 at 3:46 PM, Zhang, Jingyu wrote: > I am using spark-csv to save files in s3, it shown Size exceeds. Please let > me know how to fix it. Thanks. > > df.write() > .format("com.databricks.s

Re: Size exceeds Integer.MAX_VALUE (SparkSQL$TreeNodeException: sort, tree) on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Zhang, Jingyu
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree: Sort [net_site#50 ASC,device#6 ASC], true Exchange (RangePartitioning 200) Project [net_site#50,device#6,total_count#105L,adblock_count#106L,noanalytics_count#107L,unique_nk_count#109L] HashOuterJoin [net_site#50,dev

Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Soila Pertet Kavulya
Thanks Sean and Imran, I'll try splitting the broadcast variable into smaller ones. I had tried a regular join but it was failing due to high garbage collection overhead during the shuffle. One of the RDDs is very large and has a skewed distribution where a handful of keys account for 90% of the

Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Imran Rashid
unfortunately this is a known issue: https://issues.apache.org/jira/browse/SPARK-1476 as Sean suggested, you need to think of some other way of doing the same thing, even if its just breaking your one big broadcast var into a few smaller ones On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen wrote: >

Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Sean Owen
I think you've hit the nail on the head. Since the serialization ultimately creates a byte array, and arrays can have at most ~2 billion elements in the JVM, the broadcast can be at most ~2GB. At that scale, you might consider whether you really have to broadcast these values, or want to handle th

Re: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator

2014-09-17 Thread Nicholas Chammas
Which appears in turn to be caused by SPARK-1476 . On Wed, Sep 17, 2014 at 9:14 PM, francisco wrote: > Looks like this is a known issue: > > https://issues.apache.org/jira/browse/SPARK-1353 > > > > -- > View this message in context: > http://apac

Re: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator

2014-09-17 Thread francisco
Looks like this is a known issue: https://issues.apache.org/jira/browse/SPARK-1353 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-in-BlockFetcherIterator-tp14483p14500.html Sent from the Apache Spark User List mailing list ar

Re: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator

2014-09-17 Thread Burak Yavuz
Hi, Could you try repartitioning the data by .repartition(# of cores on machine) or while reading the data, supply the number of minimum partitions as in sc.textFile(path, # of cores on machine). It may be that the whole data is stored in one block? If it is billions of rows, then the indexing