Spark SQL JDBC Connectivity

2014-05-28 Thread Venkat Subramanian
We are planning to use the latest Spark SQL on RDDs. If a third party application wants to connect to Spark via JDBC, does Spark SQL have support? (We want to avoid going though Shark/Hive JDBC layer as we need good performance). BTW, we also want to do the same for Spark Streaming - With Spark SQ

Re: Integration issue between Apache Shark-0.9.1 (with in-house hive-0.11) and pre-existing CDH4.6 HIVE-0.10 server

2014-05-28 Thread bijoy deb
Hi, My shark-env.sh is already pointing to the hadoop2 cluster: export HADOOP_HOME="/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop" Both the hadoop cluster as well as the embedded hadoop jars within Shark are of version 2.0.0. Any more suggestions please? Thanks On Wed, May 28, 2

Re: Spark on an HPC setup

2014-05-28 Thread Jeremy Freeman
Hi Sid, We are successfully running Spark on an HPC, it works great. Here's info on our setup / approach. We have a cluster with 256 nodes running Scientific Linux 6.3 and scheduled by Univa Grid Engine. The environment also has a DDN GridScalar running GPFS and several EMC Isilon clusters se

Spark Stand-alone mode job not starting (akka Connection refused)

2014-05-28 Thread T.J. Alumbaugh
I've been trying for several days now to get a Spark application running in stand-alone mode, as described here: http://spark.apache.org/docs/latest/spark-standalone.html I'm using pyspark, so I've been following the example here: http://spark.apache.org/docs/0.9.1/quick-start.html#a-standalone-

Re: Python, Spark and HBase

2014-05-28 Thread twizansk
The code which causes the error is: The code which causes the error is: sc = SparkContext("local", "My App") rdd = sc.newAPIHadoopFile( name, 'org.apache.hadoop.hbase.mapreduce.TableInputFormat', 'org.apache.hadoop.hbase.io.ImmutableBytesWritable', 'org.apache.hadoop.hbase.client

Re: Python, Spark and HBase

2014-05-28 Thread twizansk
In my code I am not referencing PythonRDD or PythonRDDnewAPIHadoopFile at all. I am calling SparkContext.newAPIHadoopFile with: inputformat_class='org.apache.hadoop.hbase.mapreduce.TableInputFormat' key_class='org.apache.hadoop.hbase.io.ImmutableBytesWritable', value_class='org.apache.hadoop.hba

Re: Checking spark cache percentage programatically. And how to clear cache.

2014-05-28 Thread Matei Zaharia
You can remove cached RDDs by calling unpersist() on them. You can also use SparkContext.getRDDStorageInfo to get info on cache usage, though this is a developer API so it may change in future versions. We will add a standard API eventually but this is just very closely tied to framework intern

Re: Python, Spark and HBase

2014-05-28 Thread Matei Zaharia
It sounds like you made a typo in the code — perhaps you’re trying to call self._jvm.PythonRDDnewAPIHadoopFile instead of self._jvm.PythonRDD.newAPIHadoopFile? There should be a dot before the new. Matei On May 28, 2014, at 5:25 PM, twizansk wrote: > Hi Nick, > > I finally got around to do

Checking spark cache percentage programatically. And how to clear cache.

2014-05-28 Thread Sung Hwan Chung
Hi, Is there a programmatic way of checking whether RDD has been 100% cached or not? I'd like to do this to have two different path ways. Additionally, how do you clear cache (e.g. if you want to cache different RDDs, and you'd like to clear an existing cached RDD). Thanks!

Re: Python, Spark and HBase

2014-05-28 Thread twizansk
Hi Nick, I finally got around to downloading and building the patch. I pulled the code from https://github.com/MLnick/spark-1/tree/pyspark-inputformats I am running on a CDH5 node. While the code in the CDH branch is different from spark master, I do believe that I have resolved any inconsist

Re: GraphX partition problem

2014-05-28 Thread Ankur Dave
I've been trying to reproduce this but I haven't succeeded so far. For example, on the web-Google graph, I get the expected results both on v0.9.1-handle-empty-partitions and on master: // Load web-Google and run connected componentsimport org.apache

Re: Spark Memory Bounds

2014-05-28 Thread Keith Simmons
Thanks! Sounds like my rough understanding was roughly right :) Definitely understand cached RDDs can add to the memory requirements. Luckily, like you mentioned, you can configure spark to flush that to disk and bound its total size in memory via spark.storage.memoryFraction, so I have a pretty

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
posted a JIRA https://issues.apache.org/jira/browse/SPARK-1952 On Wed, May 28, 2014 at 1:14 PM, Ryan Compton wrote: > Remark, just including the jar built by sbt will produce the same > error. i,.e this pig script will fail: > > REGISTER > /usr/share/osi1/spark-1.0.0/assembly/target/scala-2.10/s

Re: Comprehensive Port Configuration reference?

2014-05-28 Thread Andrew Ash
Hmm, those do look like 4 listening ports to me. PID 3404 is an executor and PID 4762 is a worker? This is a standalone cluster? On Wed, May 28, 2014 at 8:22 AM, Jacob Eisinger wrote: > Howdy Andrew, > > Here is what I ran before an application context was created (other > services have been

Re: Invalid Class Exception

2014-05-28 Thread Suman Somasundar
On 5/27/2014 1:28 PM, Marcelo Vanzin wrote: On Tue, May 27, 2014 at 1:05 PM, Suman Somasundar wrote: I am running this on a Solaris machine with logical partitions. All the partitions (workers) access the same Spark folder. Can you check whether you have multiple versions of the offending cla

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
Remark, just including the jar built by sbt will produce the same error. i,.e this pig script will fail: REGISTER /usr/share/osi1/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; edgeList0 = LOAD '/user/rfcompton/twitter-mention-networks/bidirectional

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
It appears to be Spark 1.0 related. I made a pom.xml with a single dependency on Spark, registering the resulting jar created the error. Spark 1.0 was compiled via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly The pom.xml, as well as some other information, is below. The only thing that s

Re: K-NN by efficient sparse matrix product

2014-05-28 Thread Christian Jauvin
Thank you for your answer. Would you have by any chance some example code (even fragmentary) that I could study? On 28 May 2014 14:04, Tom Vacek wrote: > Maybe I should add: if you can hold the entire matrix in memory, then this > is embarrassingly parallel. If not, then the complications arise.

A Standalone App in Scala: Standalone mode issues

2014-05-28 Thread jaranda
During the last few days I've been trying to deploy a Scala job to a standalone cluster (master + 4 workers) without much success, although it worked perfectly when launching it from the spark shell, that is, using the Scala REPL (pretty strange, this would mean my cluster config was actually corre

Re: Re: spark table to hive table

2014-05-28 Thread Michael Armbrust
On Tue, May 27, 2014 at 6:08 PM, JaeBoo Jung wrote: > I already tried HiveContext as well as SqlContext. > > But it seems that Spark's HiveContext is not completely same as Apache > Hive. > > For example, SQL like 'SELECT RANK() OVER(ORDER BY VAL1 ASC) FROM TEST > LIMIT 10' works fine in Apache

Re: Persist and unpersist

2014-05-28 Thread Ankur Dave
Oh, I see. Another idea would be to provide something like sc.prune(a, b, c) that traverses the dependency graph of RDDs a, b, c and unpersists any cached RDDs not referenced by any other RDD. In this case you could store the return value of blowUp and call prune on it after line 9. Ankur

Re: K-NN by efficient sparse matrix product

2014-05-28 Thread Tom Vacek
The problem with matrix multiplication is that the amount of data blows up between the mapper and the reducer, and the shuffle operation is very slow. I have not ever tried this, but the shuffle can be avoided by making use of the broadcast. Say we have M = L*R. We do a column decomposition on R

Re: Problem using Spark with Hbase

2014-05-28 Thread Mayur Rustagi
Try this.. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Wed, May 28, 2014 at 7:40 PM, Vibhor Banga wrote: > Any one who has used spark this way or has faced similar issue, please > help. > > Thanks, > -Vibhor > > > O

Re: K-NN by efficient sparse matrix product

2014-05-28 Thread Tom Vacek
Maybe I should add: if you can hold the entire matrix in memory, then this is embarrassingly parallel. If not, then the complications arise. On Wed, May 28, 2014 at 1:00 PM, Tom Vacek wrote: > The problem with matrix multiplication is that the amount of data blows up > between the mapper and t

Re: Integration issue between Apache Shark-0.9.1 (with in-house hive-0.11) and pre-existing CDH4.6 HIVE-0.10 server

2014-05-28 Thread Andrew Ash
IPC version 7 vs 4 is Hadoop2 vs Hadoop1. I'm guessing your Hadoop cluster is on a different version than the .jars you're using in Shark. http://stackoverflow.com/questions/16491547/pig-to-hadoop-issue-server-ipc-version-7-cannot-communicate-with-client-version Can you try finding matching jars

Re: Spark Streaming RDD to Shark table

2014-05-28 Thread Chang Lim
OK...I needed to set the JVM class.path for the worker to find the fb class: env.put("SPARK_JAVA_OPTS", "-Djava.class.path=/home/myInc/hive-0.9.0-bin/lib/libfb303.jar"); Now I am seeing the following "spark.httpBroadcast.uri" error. What am I missing? java.util.NoSuchElementException: spark.http

K-NN by efficient sparse matrix product

2014-05-28 Thread Christian Jauvin
Hi, I'm new to Spark and Hadoop, and I'd like to know if the following problem is solvable in terms of Spark's primitives. To compute the K-nearest neighbours of a N-dimensional dataset, I can multiply my very large normalized sparse matrix by its transpose. As this yields all pairwise distance v

Re: rdd ordering gets scrambled

2014-05-28 Thread Michael Malak
Mohit Jaggi: A workaround is to use zipWithIndex (to appear in Spark 1.0, but if you're still on 0.9x you can swipe the code from  https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala  ), map it to (x => (x._2,x._1)) and then sortByKey. Sp

Integration issue between Apache Shark-0.9.1 (with in-house hive-0.11) and pre-existing CDH4.6 HIVE-0.10 server

2014-05-28 Thread bijoy deb
Hi all, I have installed Apache Shark 0.9.1 on my machine which comes bundled with hive-0.11 version of hive jars.I am trying to integrate this with my pre-existing CDH-4.6 version of the Hive server which is of version 0.10.On pointing HIVE_HOME in spark-env.sh to the cloudera version of the hive

Re: Java RDD structure for Matrix predict?

2014-05-28 Thread Sandeep Parikh
Wisely, is mapToPair in Spark 0.9.1 or 1.0? I'm running the former and didn't see that method available. I think the issue is that predict() is expecting an RDD containing a tuple of ints and not Integers. So if I use JavaPairRDD with my original code snippet, things seem to at least compile for n

Re: Comprehensive Port Configuration reference?

2014-05-28 Thread Jacob Eisinger
Howdy Andrew, Here is what I ran before an application context was created (other services have been deleted): # netstat -l -t tcp -p --numeric-ports Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

Spark on an HPC setup

2014-05-28 Thread Sidharth Kashyap
Hi, Has anyone tried to get Spark working on an HPC setup?If yes, can you please share your learnings and how you went about doing it? An HPC setup typically comes bundled with dynamically allocated cluster and a very efficient scheduler. Configuring Spark standalone in this mode of operation is

RE: GraphX partition problem

2014-05-28 Thread Zhicharevich, Alex
Hi Ankur, We’ve built it from the git link you’ve sent, and we don’t get the exception anymore. However, we’ve been facing strange indeterministic behavior from Graphx. We compute connected components on a graph of ~900K edges. We ran the spark job several times on the same input graph and got

Re: Problem using Spark with Hbase

2014-05-28 Thread Vibhor Banga
Any one who has used spark this way or has faced similar issue, please help. Thanks, -Vibhor On Wed, May 28, 2014 at 6:03 PM, Vibhor Banga wrote: > Hi all, > > I am facing issues while using spark with HBase. I am getting > NullPointerException at org.apache.hadoop.hbase.TableName.valueOf > (Ta

Re: Reading bz2 files that do not end with .bz2

2014-05-28 Thread Mayur Rustagi
You can use Hadoop APi & provide input/output reader & hadoop configuration file to read the data. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Wed, May 28, 2014 at 7:22 PM, Laurent T wrote: > Hi, > > I

Reading bz2 files that do not end with .bz2

2014-05-28 Thread Laurent T
Hi, I have a bunch of files that are bz2 compressed but do not have the extension .bz2 Is there anyway to force spark to read them as bz2 files using sc.textFile ? FYI, if i add the .bz2 extension to the file it works fine but the process that creates those files can't do that and i'd like to fin

Re: Spark Memory Bounds

2014-05-28 Thread Christopher Nguyen
Keith, please see inline. -- Christopher T. Nguyen Co-founder & CEO, Adatao linkedin.com/in/ctnguyen On Tue, May 27, 2014 at 7:22 PM, Keith Simmons wrote: > A dash of both. I want to know enough that I can "reason about", rather > than "strictly control", the amount of me

Re: Writing RDDs from Python Spark progrma (pyspark) to HBase

2014-05-28 Thread Nick Pentreath
It's not possible currently to write anything other than text (or pickle files I think in 1.0.0 or if not then in 1.0.1) from PySpark. I have an outstanding pull request to add READING any InputFormat from PySpark, and after that is in I will look into OutputFormat too. What does your data look l

Re: Akka Connection refused - standalone cluster using spark-0.9.0

2014-05-28 Thread Gino Bustelo
I've been playing with the amplab docker scripts and I needed to set spark.driver.host to the driver host ip. One that all spark processes can get to. > On May 28, 2014, at 4:35 AM, jaranda wrote: > > Same here, got stuck at this point. Any hints on what might be going on? > > > > -- > Vie

Writing RDDs from Python Spark progrma (pyspark) to HBase

2014-05-28 Thread gaurav.dasgupta
Hi, I am unable to understand how to write data directly on HBase table from a Spark (pyspark) Python program. Is this possible in the current Spark releases? If so, can someone provide an example code snippet to do this? Thanks in advance. Regards, Gaurav -- View this message in context: ht

Inter and Inra Cluster Density in KMeans

2014-05-28 Thread Stuti Awasthi
Hi, I wanted to calculate the InterClusterDensity and IntraClusterDensity from the clusters generated from KMeans. How can I achieve that? Is there any already present code/api to use for this purpose. Thanks Stuti Awasthi ::DISCLAIMER:: ---

Problem using Spark with Hbase

2014-05-28 Thread Vibhor Banga
Hi all, I am facing issues while using spark with HBase. I am getting NullPointerException at org.apache.hadoop.hbase.TableName.valueOf (TableName.java:288) Can someone please help to resolve this issue. What am I missing ? I am using following snippet of code - Configuration config =

Re: Persist and unpersist

2014-05-28 Thread Daniel Darabos
On Wed, May 28, 2014 at 12:08 AM, Ankur Dave wrote: > I think what's desired here is for input to be unpersisted automatically > as soon as result is materialized. I don't think there's currently a way > to do this, but the usual workaround is to force result to be > materialized immediately and

RE: K-nearest neighbors search in Spark

2014-05-28 Thread Carter
Hi Andrew, Thank you for your info. I will have a look at these links. Thanks, Carter Date: Tue, 27 May 2014 09:06:02 -0700 From: ml-node+s1001560n6436...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: K-nearest neighbors search in Spark Hi Carter, In Spark 1.0 there will be an

RE: K-nearest neighbors search in Spark

2014-05-28 Thread Carter
Hi Krishna, Thank you very much for your code. I will use it as a good start point. Thanks, Carter Date: Tue, 27 May 2014 16:42:39 -0700 From: ml-node+s1001560n6455...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: K-nearest neighbors search in Spark Carter, Just as a quick &

Re: Akka Connection refused - standalone cluster using spark-0.9.0

2014-05-28 Thread jaranda
Same here, got stuck at this point. Any hints on what might be going on? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Akka-Connection-refused-standalone-cluster-using-spark-0-9-0-tp1297p6463.html Sent from the Apache Spark User List mailing list archive a