RE: Using Pandas/Scikit Learning in Pyspark

2015-05-09 Thread Felix C
Your python job runs in a python process interacting with JVM. You do need matching python version and other dependent packages on the driver and all worker nodes if you run in YARN mode. --- Original Message --- From: "Bin Wang" Sent: May 8, 2015 9:56 PM To: "Apache.Spark.User" Subject: Usin

Re: spark and binary files

2015-05-09 Thread ayan guha
Spark uses any inputformat you specify and number of splits=number of RDD partitions. You may want to take a deeper look at SparkContext.newAPIHadoopRDD to load your data. On Sat, May 9, 2015 at 4:48 PM, tog wrote: > Hi > > I havé an application that currently run using MR. It currently starts

Re: Submit Spark application in cluster mode and supervised

2015-05-09 Thread James King
Many Thanks Silvio, What I found out later is the if there was catastrophic failure and all the daemons fail at the same time before any fail-over takes place in this case when you bring back the cluster up the the job resumes only on the Master is was last running on before the failure. Otherwis

Re: Duplicate entries in output of mllib column similarities

2015-05-09 Thread Richard Bolkey
Hi Reza, After a bit of digging, I had my previous issue a little bit wrong. We're not getting duplicate (i,j) entries, but we are getting transposed entries (i,j) and (j,i) with potentially different scores. We assumed the output would be a triangular matrix. Still, let me know if that's expected

Re: Spark SQL and Hive interoperability

2015-05-09 Thread barge.nilesh
hi, try your first method but create an external table in hive. like: hive -e "CREATE *EXTERNAL* TABLE people (name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';" -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-Hive-interop

Re: Spark SQL: STDDEV working in Spark Shell but not in a standalone app

2015-05-09 Thread Michael Armbrust
Are you perhaps using a HiveContext in the shell but a SQLContext in your app? I don't think we natively implement stddev until 1.4.0 On Fri, May 8, 2015 at 4:44 PM, barmaley wrote: > Given a registered table from data frame, I'm able to execute queries like > sqlContext.sql("SELECT STDDEV(col1

Re: Hash Partitioning and Dataframes

2015-05-09 Thread Michael Armbrust
Ah, unfortunately that is not possible today as Catalyst has a logical notion of partitioning that is different than that exposed by the RDD. A researcher at Databricks is considering allowing this kind of optimization for in memory cached relations though. Here is a WIP patch: https://github.com

How to implement an Evaluator for a ML pipeline?

2015-05-09 Thread Stefan H.
Hello everyone, I am stuck with the (experimental, I think) API for machine learning pipelines. I have a pipeline with just one estimator (ALS) and I want it to try different values for the regularization parameter. Therefore I need to supply an Evaluator that returns a value of type Double. I gue

Spark can not access jar from HDFS !!

2015-05-09 Thread Ravindra
Hi All, I am trying to create custom udfs with hiveContext as given below - scala> hiveContext.sql ("CREATE TEMPORARY FUNCTION sample_to_upper AS 'com.abc.api.udf.MyUpper' USING JAR 'hdfs:///users/ravindra/customUDF2.jar'") I have put the udf jar in the hdfs at the path given above. The same comm

Re: Spark can not access jar from HDFS !!

2015-05-09 Thread Michael Armbrust
That code path is entirely delegated to hive. Does hive support this? You might try instead using sparkContext.addJar. On Sat, May 9, 2015 at 12:32 PM, Ravindra wrote: > Hi All, > > I am trying to create custom udfs with hiveContext as given below - > scala> hiveContext.sql ("CREATE TEMPORARY

Spark streaming closes with Cassandra Conector

2015-05-09 Thread Sergio Jiménez Barrio
I am trying save some data in Cassandra in app with spark Streaming: Messages.foreachRDD { . . . CassandraRDD.saveToCassandra("test","test") } When I run, the app is closes when I recibe data or can't connect with Cassandra. Some idea? Thanks -- Atte. Sergio Jiménez

custom join using complex keys

2015-05-09 Thread Mathieu D
Hi folks, I need to join RDDs having composite keys like this : (K1, K2 ... Kn). The joining rule looks like this : * if left.K1 == right.K1, then we have a "true equality", and all K2... Kn are also equal. * if left.K1 != right.K1 but left.K2 == right.K2, I have a partial equality, and I also wa

Re: Duplicate entries in output of mllib column similarities

2015-05-09 Thread Reza Zadeh
Hi Richard, One reason that could be happening is that the rows of your matrix are using SparseVectors, but the entries in your vectors aren't sorted by index. Is that the case? Sparse Vectors

Re: custom join using complex keys

2015-05-09 Thread Stéphane Verlet
Create a custom key class implement the equals methods and make sure the hash method is compatible. Use that key to map and join your row. On Sat, May 9, 2015 at 4:02 PM, Mathieu D wrote: > Hi folks, > > I need to join RDDs having composite keys like this : (K1, K2 ... Kn). > > The joining ru

Re: Spark streaming closes with Cassandra Conector

2015-05-09 Thread Gerard Maas
Hola Sergio, It would help if you added the error message + stack trace. -kr, Gerard. On Sat, May 9, 2015 at 11:32 PM, Sergio Jiménez Barrio < drarse.a...@gmail.com> wrote: > I am trying save some data in Cassandra in app with spark Streaming: > > Messages.foreachRDD { > . . . > CassandraRDD.s

Re: SparkR: filter() function?

2015-05-09 Thread Shivaram Venkataraman
I replied on the SO post - The bug you ran into is a slightly different one and is with the `show` method in RDDs. I've opened a PR to fix this at https://github.com/apache/spark/pull/6035 Thanks Shivaram On Wed, May 6, 2015 at 1:55 AM, himaeda wrote: > Has this issue re-appeared? > > I posted

Spark SQL and java.lang.RuntimeException

2015-05-09 Thread Nick Travers
I'm getting the following error when reading a table from Hive. Note the spelling of the 'Primitve' in the stack trace. I can't seem to find it anywhere else online. It seems to only occur with this one particular table I am reading from. Occasionally the task will completely fail, other times it

Is the AMP lab done next February?

2015-05-09 Thread Justin Pihony
>From my StackOverflow question : Is there a way to track whether Berkeley's AMP lab will indeed shutdown next year? >From their about site: The AMPLab is a five-year collaborative effort at UC Berkeley and

Re: Spark + Kinesis

2015-05-09 Thread Chris Fregly
hey vadim- sorry for the delay. if you're interested in trying to get Kinesis working one-on-one, shoot me a direct email and we'll get it going off-list. we can circle back and summarize our findings here. lots of people are using Spark Streaming+Kinesis successfully. would love to help you t

Re: JavaKinesisWordCountASLYARN Example not working on EMR

2015-05-09 Thread Chris Fregly
Ankur- can you confirm that you got the stock JavaKinesisWordCountASL example working on EMR per Chris' suggestion? i want to stay ahead of any issues that you may encounter with the Kinesis + Spark Streaming + EMR integration as this is a popular stack. Thanks! -Chris On Fri, Mar 27, 2015 at

Re: Spark + Kinesis

2015-05-09 Thread Vadim Bichutskiy
Thanks Chris! I was just looking to get back to Spark + Kinesis integration. Will be in touch shortly. Vadim ᐧ On Sun, May 10, 2015 at 12:14 AM, Chris Fregly wrote: > hey vadim- > > sorry for the delay. > > if you're interested in trying to get Kinesis working one-on-one, shoot me > a direct em

Find KNN in Spark SQL

2015-05-09 Thread Dong Li
Hello experts, I’m new to Spark, and want to find K nearest neighbors on huge scale high-dimension points dataset in very short time. The scenario is: the dataset contains more than 10 million points, whose dimension is 200d. I’m building a web service, to receive one new point at each request

Does NullWritable can not be used in Spark?

2015-05-09 Thread donhoff_h
Hi, experts. I wrote a spark program to write a sequence file. I found if I used the NullWritable as the Key Class of the SequenceFile, the program reported exceptions. But if I used the BytesWritable or Text as the Key Class, the program did not report the exceptions. Does spark not support

Re: custom join using complex keys

2015-05-09 Thread ayan guha
This should work se1 = sc.parallelize(setupRow(10),1) base2 = sc.parallelize(setupRow(10),1) df1 = ssc.createDataFrame(base1) df2 = ssc.createDataFrame(base2) df1.show() df2.show() df1.registerTempTable("df1") df2.registerTempTable("df2") j = ssc.sql("select df1