Are there any plans to run Spark on top of Succinct

2015-01-22 Thread Mick Davies

http://succinct.cs.berkeley.edu/wp/wordpress/

Looks like a really interesting piece of work that could dovetail well with
Spark.

I have been trying recently to optimize some queries I have running on Spark
on top of Parquet but the support from Parquet for predicate push down
especially for dictionary based columns is a bit limiting. I am not sure,
but from a cursory view it looks like this format may help in this area.

Mick




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Are-there-any-plans-to-run-Spark-on-top-of-Succinct-tp10243.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: KNN for large data set

2015-01-22 Thread DEVAN M.S.
Thanks Xiangrui Meng will try this.

And, found this https://github.com/kaushikranjan/knnJoin also.
Will this work with double data ? Can we find out z value of
*Vector(10.3,4.5,3,5)* ?






On Thu, Jan 22, 2015 at 12:25 AM, Xiangrui Meng  wrote:

> For large datasets, you need hashing in order to compute k-nearest
> neighbors locally. You can start with LSH + k-nearest in Google
> scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui
>
> On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S.  wrote:
> > Hi all,
> >
> > Please help me to find out best way for K-nearest neighbor using spark
> for
> > large data sets.
> >
>


Re: Are there any plans to run Spark on top of Succinct

2015-01-22 Thread Dean Wampler
Interesting. I was wondering recently if anyone has explored working with
compressed data directly.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Thu, Jan 22, 2015 at 2:59 AM, Mick Davies 
wrote:

>
> http://succinct.cs.berkeley.edu/wp/wordpress/
>
> Looks like a really interesting piece of work that could dovetail well with
> Spark.
>
> I have been trying recently to optimize some queries I have running on
> Spark
> on top of Parquet but the support from Parquet for predicate push down
> especially for dictionary based columns is a bit limiting. I am not sure,
> but from a cursory view it looks like this format may help in this area.
>
> Mick
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Are-there-any-plans-to-run-Spark-on-top-of-Succinct-tp10243.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


query planner design doc?

2015-01-22 Thread Nicholas Murphy
Hi-

Quick question: is there a design doc (or something more than “look at the 
code”) for the query planner for Spark SQL (i.e., the component that 
takes…Catalyst?…operator trees and translates them into SPARK operations)?

Thanks,
Nick
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: query planner design doc?

2015-01-22 Thread Michael Armbrust
Here is the initial design document for catalyst :
https://docs.google.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit

Strategies (many of which are in SparkStragegies.scala) are the part that
creates the physical operators from a catalyst logical plan.  These
operators have execute() methods that actually call RDD operations.

On Thu, Jan 22, 2015 at 3:19 PM, Nicholas Murphy 
wrote:

> Hi-
>
> Quick question: is there a design doc (or something more than “look at the
> code”) for the query planner for Spark SQL (i.e., the component that
> takes…Catalyst?…operator trees and translates them into SPARK operations)?
>
> Thanks,
> Nick
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: spark 1.1.0 (w/ hadoop 2.4) vs aws java sdk 1.7.2

2015-01-22 Thread William-Smith
I have had the same issue while using HttpClient from AWS EMR Spark Streaming
to post to a nodejs server.

I have found ... using
Classloder.getResource('org/apache/http/client/HttpClient")  that the
class 
Is being loaded front the spark-assembly-1.1.0-hadoop2.4.0.jar.

That in itself is not the issue because the version is 4.2.5  the same
version I am using on my local machine with success  using Hadoop cdh 5.



The issue is that HttpClient relies on Httpcore  and there is an old
commons-httpcore-1.3.jar as well as httpcore-4.5.2 in the spark-assembly
jar.

It looks like the old one is getting loaded first.

So the fix might be to build the Spark jar myself without the httpcore-1.3
and replace it on bootstrap.
I will keep you posted on the outcome.





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-1-0-w-hadoop-2-4-vs-aws-java-sdk-1-7-2-tp8481p10250.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark performance gains for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello,

We were comparing performance of some of our production hive queries
between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
Spark 0.9 and 1.1. We could see that the performance gains have been good
in Spark.

We tried a very simple query,
select count(*) from T where col3=123
in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
performance had been 2x better than Hive (120sec vs 60sec). Table T is
stored in S3 and contains 600MB single GZIP file.

My question is, why Spark is faster than Hive here? In both of the cases,
the file will be downloaded, uncompressed and lines will be counted by a
single process. For Hive case, reducer will be identity function
since hive.map.aggr is true.

Note that disk spills and network I/O are very less for Hive's case as well,
--
Regards,
Saumitra Shahapure