How to convert a non-rdd data to rdd.

2014-10-11 Thread rapelly kartheek
Hi, I am trying to write a String that is not an rdd to HDFS. This data is a variable in Spark Scheduler code. None of the spark File operations are working because my data is not rdd. So, I tried using SparkContext.parallelize(data). But it throws error: [error] /home/karthik/spark-1.0.0/core/s

Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Davies Liu
Created JIRA for this: https://issues.apache.org/jira/browse/SPARK-3915 On Sat, Oct 11, 2014 at 12:40 PM, Evan Samanas wrote: > It's true that it is an implementation detail, but it's a very important one > to document because it has the possibility of changing results depending on > when I use t

Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-11 Thread Ilya Ganelin
Because of how closures work in Scala, there is no support for nested map/rdd-based operations. Specifically, if you have Context a { Context b { } } Operations within context b, when distributed across nodes, will no longer have visibility of variables specific to context a because that

Re: Blog post: An Absolutely Unofficial Way to Connect Tableau to SparkSQL (Spark 1.1)

2014-10-11 Thread Matei Zaharia
Very cool Denny, thanks for sharing this! Matei On Oct 11, 2014, at 9:46 AM, Denny Lee wrote: > https://www.concur.com/blog/en-us/connect-tableau-to-sparksql > > If you're wondering how to connect Tableau to SparkSQL - here are the steps > to connect Tableau to SparkSQL. > > > > Enjoy! >

Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Evan Samanas
It's true that it is an implementation detail, but it's a very important one to document because it has the possibility of changing results depending on when I use take or collect. The issue I was running in to was when the executor had a different operating system than the driver, and I was using

Streams: How do RDDs get Aggregated?

2014-10-11 Thread jay vyas
Hi spark ! I dont quite yet understand the semantics of RDDs in a streaming context very well yet. Are there any examples of how to implement CustomInputDStreams, with corresponding Receivers in the docs ? Ive hacked together a custom stream, which is being opened and is consuming data internal

RE: Spark SQL parser bug?

2014-10-11 Thread Mohammed Guller
I tried even without the “T” and it still returns an empty result: scala> val sRdd = sqlContext.sql("select a from x where ts >= '2012-01-01 00:00:00';") sRdd: org.apache.spark.sql.SchemaRDD = SchemaRDD[35] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [a#0] ExistingR

How To Implement More Than One Subquery in Scala/Spark

2014-10-11 Thread arthur.hk.c...@gmail.com
Hi, My Spark version is v1.1.0 and Hive is 0.12.0, I need to use more than 1 subquery in my Spark SQL, below are my sample table structures and a SQL that contains more than 1 subquery. Question 1: How to load a HIVE table into Scala/Spark? Question 2: How to implement a SQL_WITH_MORE_THAN_O

Re: how to find the sources for spark-project

2014-10-11 Thread Ted Yu
I found this on computer where I built Spark: $ jar tvf /homes/hortonzy/.m2/repository//org/spark-project/hive/hive-exec/0.13.1/hive-exec-0.13.1.jar | grep ParquetHiveSerDe 2228 Mon Jun 02 12:50:16 UTC 2014 org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe$1.class 1442 Mon Jun 02 12:

Fwd: how to find the sources for spark-project

2014-10-11 Thread Sadhan Sood
-- Forwarded message -- From: Sadhan Sood Date: Sat, Oct 11, 2014 at 10:26 AM Subject: Re: how to find the sources for spark-project To: Stephen Boesch Thanks, I still didn't find it - is it under some particular branch ? More specifically, I am looking to modify the file: Parqu

Re: return probability \ confidence instead of actual class

2014-10-11 Thread Adamantios Corais
Thank you Sean. I'll try to do it externally as you suggested, however, can you please give me some hints on how to do that? In fact, where can I find the 1.2 implementation you just mentioned? Thanks! On Wed, Oct 8, 2014 at 12:58 PM, Sean Owen wrote: > Plain old SVMs don't produce an estimat

Re: RDD size in memory - Array[String] vs. case classes

2014-10-11 Thread Sean Owen
Yes of course. If your number is "123456", the this takes 4 bytes as an int. But as a String in a 64-bit JVM you have an 8-byte reference, 4-byte object overhead, a char count of 4 bytes, and 6 2-byte chars. Maybe more i'm not thinking of. On Sat, Oct 11, 2014 at 6:29 AM, Liam Clarke-Hutchinson w

Re: spark-sql failing for some tables in hive

2014-10-11 Thread Cheng Lian
Hmm, the details of the error didn't show in your mail... On 10/10/14 12:25 AM, sadhan wrote: We have a hive deployement on which we tried running spark-sql. When we try to do describe for some of the tables, spark-sql fails with this: while it works for some of the other tables. Confused and

Re: Spark SQL - Exception only when using cacheTable

2014-10-11 Thread Cheng Lian
How was the table created? Would you mind to share related code? It seems that the underlying type of the |customer_id| field is actually long, but the schema says it’s integer, basically it’s a type mismatch error. The first query succeeds because |SchemaRDD.count()| is translated to somethi

Re: Window comparison matching using the sliding window functionality: feasibility

2014-10-11 Thread Sean Owen
I suspect you do not actually need to change the number of partitions dynamically. Do you just have groupings of data to process? use an RDD of (K,V) pairs and things like groupByKey. If really have only 1000 unique keys, yes, only half of the 2000 workers would get data in a phase that groups by