Re: Spark SQL sort by and collect by in multiple partitions

2015-09-02 Thread Vishnu Kumar
Hi, Yes this is intended behavior. "ORDER BY" guarantees the total order in output while "SORT BY" guarantees the order within a partition. Vishnu On Thu, Sep 3, 2015 at 10:49 AM, Niranda Perera wrote: > Hi all, > > I have been using sort by and order by in spark sql and I observed the > fol

Spark SQL sort by and collect by in multiple partitions

2015-09-02 Thread Niranda Perera
Hi all, I have been using sort by and order by in spark sql and I observed the following when using SORT BY and collect results, the results are getting sorted partition by partition. example: if we have 1, 2, ... , 12 and 4 partitions and I want to sort it in descending order, partition 0 (p0) w

Re: taking an n number of rows from and RDD starting from an index

2015-09-02 Thread Niranda Perera
Hi all, thank you for your response. after taking a look at the implementations of rdd.collect(), I thought of using the rdd.runJob(...) method . for (int i = 0; i < dataFrame.rdd().partitions().length; i++) { dataFrame.sqlContext().sparkContext().runJob(data.rdd(), some function

[HELP] Spark 1.4.1 tasks take ridiculously long time to complete

2015-09-02 Thread lankaz
Hi this is a image of the screent shot some take seconds to execute some take hours. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/HELP-Spark-1-

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-02 Thread Sean Owen
- As usual the license and signatures are OK - No blockers, check - 9 "Critical" bugs for 1.5.0 are listed below just for everyone's reference (48 total issues still targeted for 1.5.0) - Under Java 7 + Ubuntu 15, I only had one consistent test failure, but obviously it's not failing in Jenkins - I

Re: OOM in spark driver

2015-09-02 Thread Mike Hynes
Just a thought; this has worked for me before on standalone client with a similar OOM error in a driver thread. Try setting: export SPARK_DAEMON_MEMORY=4G #or whatever size you can afford on your machine in your environment/spark-env.sh before running spark-submit. Mike On 9/2/15, ankit tyagi wro

Harmonic centrality in GraphX

2015-09-02 Thread Pavel Gladkov
Hi, What do you think about this algorithm https://github.com/webgeist/spark-centrality/blob/master/src/main/scala/cc/p2k/spark/graphx/lib/HarmonicCentrality.scala This is an implementation of the Harmonic Centrality algorithm http://infoscience.epfl.ch/record/200525/files/%5BEN%5DASNA09.pdf. Sh

Re: [ compress in-memory column storage used in sparksql cache table ]

2015-09-02 Thread Cheng Lian
Yeah, two of the reasons why the built-in in-memory columnar storage doesn't achieve comparable compression ratio as Parquet are: 1. The in-memory columnar representation doesn't handle nested types. So array/map/struct values are not compressed. 2. Parquet may use more than one kind of compres

Re: [ compress in-memory column storage used in sparksql cache table ]

2015-09-02 Thread Nitin Goyal
I think spark sql's in-memory columnar cache already does compression. Check out classes in following path :- https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/compression Although compression ratio is not as good as Parquet. Thanks -Nitin -- Vi

Re: taking an n number of rows from and RDD starting from an index

2015-09-02 Thread Juan Rodríguez Hortalá
Hi, Maybe you could use zipWithIndex and filter to skip the first elements. For example starting from scala> sc.parallelize(100 to 120, 4).zipWithIndex.collect res12: Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3), (104,4), (105,5), (106,6), (107,7), (108,8), (109,9), (110,10), (11