[POWERED BY] Radius Intelligence

2015-02-10 Thread Alexis Roos
Also long due given our usage of Spark .. Radius Intelligence: URL: radius.com Description: Spark, MLLib Using Scala, Spark and MLLib for Radius Marketing and Sales intelligence platform including data aggregation, data processing, data clustering, data analysis and predictive modeling of all

Re: ImportError: No module named pyspark, when running pi.py

2015-02-10 Thread Felix C
Agree. PySpark would call spark-submit. Check out the command line there. --- Original Message --- From: "Mohit Singh" Sent: February 9, 2015 11:26 PM To: "Ashish Kumar" Cc: user@spark.apache.org Subject: Re: ImportError: No module named pyspark, when running pi.py I think you have to run that

Re: Executor Lost with StorageLevel.MEMORY_AND_DISK_SER

2015-02-10 Thread Marius Soutier
Unfortunately no. I just removed the persist statements to get the job to run, but now it sometimes fails with Job aborted due to stage failure: Task 162 in stage 2.1 failed 4 times, most recent failure: Lost task 162.3 in stage 2.1 (TID 1105, xxx.compute.internal): java.io.FileNotFoundExceptio

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Akhil Das
You could try increasing the driver memory. Also, can you be more specific about the data volume? Thanks Best Regards On Mon, Feb 9, 2015 at 3:30 PM, Yifan LI wrote: > Hi, > > I just found the following errors during computation(graphx), anyone has > ideas on this? thanks so much! > > (I think

How to efficiently utilize all cores?

2015-02-10 Thread matha.harika
Hi, I have a cluster setup with three slaves, 4 cores each(12 cores in total). When I try to run multiple applications, using 4 cores each, only the first application is running(with 2,1,1 cores used in corresponding slaves). Every other application is going to WAIT state. Following the solution p

Re: Shuffle write increases in spark 1.2

2015-02-10 Thread chris
Hello, as the original message never got accepted to the mailinglist, I quote it here completely: Kevin Jung wrote > Hi all, > The size of shuffle write showing in spark web UI is much different when I > execute same spark job on same input data(100GB) in both spark 1.1 and > spark 1.2. > At the

Re: Custom streaming receiver slow on YARN

2015-02-10 Thread Akhil Das
Not quiet sure, but one assumption would be that you are not having sufficient memory to hold that much of data and the process gets busy in cleaning the garbage and it could be the reason it works when you set MEMORY_AND_DISK_SER_2. Thanks Best Regards On Mon, Feb 9, 2015 at 8:38 PM, Jong Wook K

map distribuited matrix (rowMatrix)

2015-02-10 Thread Donbeo
I have a rowMatrix x and I would like to apply a function to each element of x. I was thinking something likex map(u=>exp(-u*u)) . How can I do something like that? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/map-distribuited-matrix-rowMatrix-tp215

Re: Shuffle write increases in spark 1.2

2015-02-10 Thread chris
Hello, as the original message from Kevin Jung never got accepted to the mailinglist, I quote it here completely: Kevin Jung wrote > Hi all, > The size of shuffle write showing in spark web UI is much different when I > execute same spark job on same input data(100GB) in both spark 1.1 and > spa

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Yifan LI
Hi Akhil, Excuse me, I am trying a random-walk algorithm over a not that large graph(~1GB raw dataset, including ~5million vertices and ~60million edges) on a cluster which has 20 machines. And, the property of each vertex in graph is a hash map, of which size will increase dramatically during

Kafka Version Update 0.8.2 status?

2015-02-10 Thread critikaled
When can we expect the latest kafka and scala 2.11 support in spark streaming? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Version-Update-0-8-2-status-tp21573.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Akhil Das
Did you have a chance to look at this doc http://spark.apache.org/docs/1.2.0/tuning.html Thanks Best Regards On Tue, Feb 10, 2015 at 4:13 PM, Yifan LI wrote: > Hi Akhil, > > Excuse me, I am trying a random-walk algorithm over a not that large > graph(~1GB raw dataset, including ~5million vertic

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Yifan LI
Yes, I have read it, and am trying to find some way to do that… Thanks :) Best, Yifan LI > On 10 Feb 2015, at 12:06, Akhil Das wrote: > > Did you have a chance to look at this doc > http://spark.apache.org/docs/1.2.0/tuning.html > > > Than

Re: Kafka Version Update 0.8.2 status?

2015-02-10 Thread Sean Owen
I believe Kafka 0.8.2 is imminent for master / 1.4.0 -- see SPARK-2808. Anecdotally I have used Spark 1.2 with Kafka 0.8.2 without problem already. On Feb 10, 2015 10:53 AM, "critikaled" wrote: > When can we expect the latest kafka and scala 2.11 support in spark > streaming? > > > > -- > View th

Re: How to efficiently utilize all cores?

2015-02-10 Thread Akhil Das
You can look at http://spark.apache.org/docs/1.2.0/job-scheduling.html I would go with mesos http://spark.apache.org/docs/1.2.0/running-on-mesos.html Thanks Best Regards On Tue, Feb 10, 2015 at 2:59 PM, matha.harika wrote: > Hi, > > I have a cluster setup with three slaves, 4 cores each(12 cor

Re: Spark nature of file split

2015-02-10 Thread Brad
Have you been able to confirm this behaviour since posting? Have you tried this out on multiple workers and viewed their memory consumption? I'm new to Spark and don't have a cluster to play with at present, and want to do similar loading from NFS files. My understanding is that calls to SparkC

[spark sql] "add file" doesn't work

2015-02-10 Thread wangzhenhua (G)
Hi all, I'm testing the spark sql module, and I found a problem with one of the test cases. I think the main problem is that the "add file" command in spark sql (hive?) doesn't work. since conducting an additional test by directly giving the path to the file offers the right answer. The tests

Spark SQL - Column name including a colon in a SELECT clause

2015-02-10 Thread presence2001
Hi list, I have some data with a field name of f:price (it's actually part of a JSON structure loaded from ElasticSearch via elasticsearch-hadoop connector, but I don't think that's significant here). I'm struggling to figure out how to express that in a Spark SQL SELECT statement without generati

Re: Shuffle write increases in spark 1.2

2015-02-10 Thread Xuefeng Wu
It looks because different snappy version, if you disable compress or switch to lz4, the size is no different. Yours, Xuefeng Wu 吴雪峰 敬上 > On 2015年2月10日, at 下午6:13, chris wrote: > > Hello, > > as the original message from Kevin Jung never got accepted to the > mailinglist, I quote it here com

Cumulative moving average of stream

2015-02-10 Thread Laeeq Ahmed
Hi, I found windowed mean as fallows:  val counts = myStream.map(x => (x.toDouble,1)).reduceByWindow((a, b) => (a._1 + b._1, a._2 + b._2),(a, b) => (a._1 - b._1, a._2 - b._2), Seconds(2), Seconds(2)) val windowMean = counts.map(x => (x._1.toFloat/x._2)) Now I want to find cumulative moving avera

hadoopConfiguration for StreamingContext

2015-02-10 Thread Marc Limotte
I see that StreamingContext has a hadoopConfiguration() method, which can be used like this sample I found: sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "XX"); > sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "XX"); But StreamingContext doesn't have the same thing. I w

Re: Kafka Version Update 0.8.2 status?

2015-02-10 Thread Ted Yu
Compiling Spark master branch against Kafka 0.8.2, I got: [WARNING] /home/hbase/spark/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala:62: no valid targets for annotation on value ssc_ - it is discarded unused. You may specify targets with meta-annotat

Re: Kafka Version Update 0.8.2 status?

2015-02-10 Thread Sean Owen
Yes, did you see the PR for SPARK-2808? https://github.com/apache/spark/pull/3631/files It requires more than just changing the version. On Tue, Feb 10, 2015 at 3:11 PM, Ted Yu wrote: > Compiling Spark master branch against Kafka 0.8.2, I got: > > [WARNING] > /home/hbase/spark/external/kafka/src

Re: hadoopConfiguration for StreamingContext

2015-02-10 Thread Akhil Das
Try the following: 1. Set the access key and secret key in the sparkContext: ssc.sparkContext.hadoopConfiguration.set("AWS_ACCESS_KEY_ID",yourAccessKey) ssc.sparkContext.hadoopConfiguration.set("AWS_SECRET_ACCESS_KEY",yourSecretKey) 2. Set the access key and secret key in the environment befo

Re: hadoopConfiguration for StreamingContext

2015-02-10 Thread Marc Limotte
Thanks, Akhil. I had high hopes for #2, but tried all and no luck. I was looking at the source and found something interesting. The Stack Trace (below) directs me to FileInputDStream.scala (line 141). This is version 1.1.1, btw. Line 141 has: private def fs: FileSystem = { > if (fs_ ==

Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread shahid
INFO scheduler.TaskSetManager: Starting task 2.1 in stage 2.0 (TID 9, ip-10-80-98-118.ec2.internal, PROCESS_LOCAL, 1025 bytes) 15/02/10 15:54:08 INFO scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 6) on executor ip-10-80-15-145.ec2.internal: org.apache.spark.SparkException (Data of type

Re: Kafka Version Update 0.8.2 status?

2015-02-10 Thread Cody Koeninger
That PR hasn't been updated since the new kafka streaming stuff (including KafkaCluster) got merged to master, it will require more changes than what's in there currently. On Tue, Feb 10, 2015 at 9:25 AM, Sean Owen wrote: > Yes, did you see the PR for SPARK-2808? > https://github.com/apache/spar

Resource allocation in yarn-cluster mode

2015-02-10 Thread Zsolt Tóth
Hi, I'm using Spark in yarn-cluster mode and submit the jobs programmatically from the client in Java. I ran into a few issues when tried to set the resource allocation properties. 1. It looks like setting spark.executor.memory, spark.executor.cores and spark.executor.instances have no effect bec

Re: Spark SQL - Column name including a colon in a SELECT clause

2015-02-10 Thread Yin Huai
Can you try using backticks to quote the field name? Like `f:price`. On Tue, Feb 10, 2015 at 5:47 AM, presence2001 wrote: > Hi list, > > I have some data with a field name of f:price (it's actually part of a JSON > structure loaded from ElasticSearch via elasticsearch-hadoop connector, but > I d

Re: Resource allocation in yarn-cluster mode

2015-02-10 Thread Zsolt Tóth
One more question: Is there reason why Spark throws an error when requesting too much memory instead of capping it to the maximum value (as YARN would do by default)? Thanks! 2015-02-10 17:32 GMT+01:00 Zsolt Tóth : > Hi, > > I'm using Spark in yarn-cluster mode and submit the jobs programmatical

Re: Spark SQL - Column name including a colon in a SELECT clause

2015-02-10 Thread Neil Andrassy (The Filter)
Thanks Yin. That worked for me. Much appreciated! On 10 February 2015 at 16:34, Yin Huai wrote: > Can you try using backticks to quote the field name? Like `f:price`. > > On Tue, Feb 10, 2015 at 5:47 AM, presence2001 > wrote: > >> Hi list, >> >> I have some data with a field name of f:price (it

Need a partner

2015-02-10 Thread King sami
Hi, As I'm beginner in Spark, I'm looking for someone who's also beginner to learn and train on Spark together. Please contact me if interested Cordially,

Stepsize with Linear Regression

2015-02-10 Thread Rishi Yadav
Are there any thumbrules how to set stepsize with gradient descent. I am using it for Linear Regression but I am sure it applies in general to gradient descent. I am at present deriving a number which fits closest to training data set response variable values. I am sure there is a better way to

RE: hadoopConfiguration for StreamingContext

2015-02-10 Thread Andrew Lee
It looks like this is related to the underlying Hadoop configuration. Try to deploy the Hadoop configuration with your job with --files and --driver-class-path, or to the default /etc/hadoop/conf core-site.xml. If that is not an option (depending on how your Hadoop cluster is setup), then hard co

does updateStateByKey return Seq() ordered?

2015-02-10 Thread Adrian Mocanu
I was looking at updateStateByKey documentation, It passes in a values Seq which contains values that have the same key. I would like to know if there is any ordering to these values. My feeling is that there is no ordering, but maybe it does preserve RDD ordering. Example: RDD[ (a,2), (a,3), (a

Re: Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread Costin Leau
First off, I'd recommend using the latest es-hadoop beta (2.1.0.Beta3) or even better, the dev build [1]. Second, using the native Java/Scala API [2] since the configuration and performance are both easier. Third, when you are using JSON input, tell es-hadoop/spark that. the connector can work

Re: Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread shahid ashraf
hi costin i upgraded the es hadoop connector , and at this point i can't use scala, but still getting same error On Tue, Feb 10, 2015 at 10:34 PM, Costin Leau wrote: > Hi shahid, > > I've sent the reply to the group - for some reason I replied to your > address instead of the mailing list. > Let

Re: Can we execute "create table" and "load data" commands against Hive inside HiveContext?

2015-02-10 Thread Yin Huai
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory was introduced in Hive 0.14 and Spark SQL only supports Hive 0.12 and 0.13.1. Can you change the setting of hive.security.authorization.manager to someone accepted by 0.12 or 0.13.1? On Thu, Feb 5, 2015

Re: Resource allocation in yarn-cluster mode

2015-02-10 Thread Sandy Ryza
Hi Zsolt, spark.executor.memory, spark.executor.cores, and spark.executor.instances are only honored when launching through spark-submit. Marcelo is working on a Spark launcher (SPARK-4924) that will enable using these programmatically. That's correct that the error comes up when yarn.scheduler.

Re: Need a partner

2015-02-10 Thread Kartik Mehta
Hi Sami and fellow Spark friends, I too am looking for joint learning, online. I have set up spark but need to do on multi nodes on my home server. We can form a group and do group learning? Thanks, Kartik On Feb 10, 2015 11:52 AM, "King sami" wrote: > Hi, > > As I'm beginner in Spark, I'm lo

Re: Need a partner

2015-02-10 Thread Mohit Singh
I would be interested too. On Tue, Feb 10, 2015 at 9:41 AM, Kartik Mehta wrote: > Hi Sami and fellow Spark friends, > > I too am looking for joint learning, online. > > I have set up spark but need to do on multi nodes on my home server. We > can form a group and do group learning? > > Thanks, >

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
I'm still getting an error. Here's my code, which works successfully when tested using spark-shell: val badIPs = sc.textFile("/user/sb/badfullIPs.csv").collect val badIpSet = badIPs.toSet val badIPsBC = sc.broadcast(badIpSet) The job looks OK from my end: 15/02/07 18:59:58 IN

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Kelvin Chu
Since the stacktrace shows kryo is being used, maybe, you could also try increasing spark.kryoserializer.buffer.max.mb. Hope this help. Kelvin On Tue, Feb 10, 2015 at 1:26 AM, Akhil Das wrote: > You could try increasing the driver memory. Also, can you be more specific > about the data volume?

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Sandy Ryza
Is the SparkContext you're using the same one that the StreamingContext wraps? If not, I don't think using two is supported. -Sandy On Tue, Feb 10, 2015 at 9:58 AM, Jon Gregg wrote: > I'm still getting an error. Here's my code, which works successfully when > tested using spark-shell: > >

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
They're separate in my code, how can I combine them? Here's what I have: val sparkConf = new SparkConf() val ssc = new StreamingContext(sparkConf, Seconds(bucketSecs)) val sc = new SparkContext() On Tue, Feb 10, 2015 at 1:02 PM, Sandy Ryza wrote: > Is the SparkContext you'r

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Sandy Ryza
You should be able to replace that second line with val sc = ssc.sparkContext On Tue, Feb 10, 2015 at 10:04 AM, Jon Gregg wrote: > They're separate in my code, how can I combine them? Here's what I have: > > val sparkConf = new SparkConf() > val ssc = new StreamingContext(sparkCon

FYI: Prof John Canny is giving a talk on "Machine Learning at the limit" in SF Big Analytics Meetup

2015-02-10 Thread Chester Chen
Just in case you are in San Francisco, we are having a meetup by Prof John Canny http://www.meetup.com/SF-Big-Analytics/events/220427049/ Chester

Re: Re: Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread Costin Leau
What's the signature of your RDD? It looks to be a List which can't be mapped automatically to a document - you are probably thinking of a tuple or better yet a PairRDD. Convert your RDD to a Pair and use that instead. This is a guess - a gist with a simple test/code would make it easier to dia

pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread rok
I'm trying to use a broadcasted dictionary inside a map function and am consistently getting Java null pointer exceptions. This is inside an IPython session connected to a standalone spark cluster. I seem to recall being able to do this before but at the moment I am at a loss as to what to try next

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
OK that worked and getting close here ... the job ran successfully for a bit and I got output for the first couple buckets before getting a "java.lang.Exception: Could not compute split, block input-0-1423593163000 not found" error. So I bumped up the memory at the command line from 2 gb to 5 gb,

Spark streaming job throwing ClassNotFound exception when recovering from checkpointing

2015-02-10 Thread Conor Fennell
I am getting the following error when I kill the spark driver and restart the job: 15/02/10 17:31:05 INFO CheckpointReader: Attempting to load checkpoint from > file > hdfs://hdfs-namenode.vagrant:9000/reporting/SampleJob$0.1.0/checkpoint-142358910.bk > 15/02/10 17:31:05 WARN CheckpointReader:

Re: Similar code in Java

2015-02-10 Thread Ted Yu
Please take a look at: examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaDirectKafkaWordCount.java which was checked in yesterday. On Sat, Feb 7, 2015 at 10:53 AM, Eduardo Costa Alfaia < e.costaalf...@unibs.it> wrote: > Hi Ted, > > I’ve seen the codes, I am using JavaKafk

Re: hadoopConfiguration for StreamingContext

2015-02-10 Thread Marc Limotte
Looks like the latest version 1.2.1 actually does use the configured hadoop conf. I tested it out and that does resolve my problem. thanks, marc On Tue, Feb 10, 2015 at 10:57 AM, Marc Limotte wrote: > Thanks, Akhil. I had high hopes for #2, but tried all and no luck. > > I was looking at the

Spark streaming job throwing ClassNotFound exception when recovering from checkpointing

2015-02-10 Thread Conor Fennell
I am getting the following error when I kill the spark driver and restart the job: 15/02/10 17:31:05 INFO CheckpointReader: Attempting to load checkpoint from > file > hdfs://hdfs-namenode.vagrant:9000/reporting/SampleJob$0.1.0/checkpoint-142358910.bk > 15/02/10 17:31:05 WARN CheckpointReader:

SparkSQL + Tableau Connector

2015-02-10 Thread Todd Nist
Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For exampl

Spark streaming job throwing ClassNotFound exception when recovering from checkpointing

2015-02-10 Thread conor
I am getting the following error when I kill the spark driver and restart the job: 15/02/10 17:31:05 INFO CheckpointReader: Attempting to load checkpoint from file hdfs://hdfs-namenode.vagrant:9000/reporting/SampleJob$0.1.0/checkpoint-142358910.bk 15/02/10 17:31:05 WARN CheckpointReader: Error

Spark on very small files, appropriate use case?

2015-02-10 Thread soupacabana
Hi all, I have the following use case: One job consists of reading from 500-2000 small bzipped logs that are on an nfs. (Small means, that the zipped logs are between 0-100KB, average file size is 20KB.) We read the log lines, do some transformations, and write them to one output file. When we

Re: Can spark job server be used to visualize streaming data?

2015-02-10 Thread Kelvin Chu
Hi Su, Out of the box, no. But, I know people integrate it with Spark Streaming to do real-time visualization. It will take some work though. Kelvin On Mon, Feb 9, 2015 at 5:04 PM, Su She wrote: > Hello Everyone, > > I was reading this blog post: > http://homes.esat.kuleuven.be/~bioiuser/blog/

spark python exception

2015-02-10 Thread Kane Kim
sometimes I'm getting this exception: Traceback (most recent call last): File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/daemon.py", line 162, in manager code = worker(sock) File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/daemon.py", line 64, in worker outfile.flush() IOError:

Re: Spark on very small files, appropriate use case?

2015-02-10 Thread Davies Liu
Spark is an framework to do things in parallel very easy, it definitely will help your cases. def read_file(path): lines = open(path).readlines() # bzip2 return lines filesRDD = sc.parallelize(path_to_files, N) lines = filesRDD.flatMap(read_file) Then you could do other transforms on li

Re: spark python exception

2015-02-10 Thread Davies Liu
No, it's a feature to help debugging (it may mute in future). On Tue, Feb 10, 2015 at 12:45 PM, Kane Kim wrote: > sometimes I'm getting this exception: > > Traceback (most recent call last): > File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/daemon.py", line 162, > in manager > code = w

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Davies Liu
Could you paste the NPE stack trace here? It will better to create a JIRA for it, thanks! On Tue, Feb 10, 2015 at 10:42 AM, rok wrote: > I'm trying to use a broadcasted dictionary inside a map function and am > consistently getting Java null pointer exceptions. This is inside an IPython > session

Why can't Spark find the classes in this Jar?

2015-02-10 Thread Abe Handler
I am new to spark. I am trying to compile and run a spark application that requires classes from an (external) jar file on my local machine. If I open the jar (on ~/Desktop) I can see the missing class in the local jar but when I run spark I get NoClassDefFoundError: edu/stanford/nlp/ie/AbstractSe

Open file limit settings for Spark on Yarn job

2015-02-10 Thread Arun Luthra
Hi, I'm running Spark on Yarn from an edge node, and the tasks on the run Data Nodes. My job fails with the "Too many open files" error once it gets to groupByKey(). Alternatively I can make it fail immediately if I repartition the data when I create the RDD. Where do I need to make sure that uli

Re: Spark on very small files, appropriate use case?

2015-02-10 Thread Kelvin Chu
I had a similar use case before. I found: 1. textFile() produced one partition per file. It can result in many partitions. I found that calling coalecse() without shuffle helped. 2. If you used persist(), count() will do I/O and put the result into cache. Transformation later did computation out

spark sql registerFunction with 1.2.1

2015-02-10 Thread Mohnish Kodnani
Hi, I am trying a very simple registerFunction and it is giving me errors. I have a parquet file which I register as temp table. Then I define a UDF. def toSeconds(timestamp: Long): Long = timestamp/10 sqlContext.registerFunction("toSeconds", toSeconds _) val result = sqlContext.sql("select

Re: Will Spark serialize an entire Object or just the method referred in an object?

2015-02-10 Thread Marcelo Vanzin
Hi Yitong, It's not as simple as that. In your very simple example, the only things referenced by the closure are (i) the input arguments and (ii) a Scala "object". So there are no external references to serialize in that case, just the closure instance itself - see, there is still something bein

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Rok Roskar
I get this in the driver log: java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) at org.apache.spark

Re: ZeroMQ and pyspark.streaming

2015-02-10 Thread Arush Kharbanda
No, zeromq api is not supported in python as of now. On 5 Feb 2015 21:27, "Sasha Kacanski" wrote: > Does pyspark supports zeroMQ? > I see that java does it, but I am not sure for Python? > regards > > -- > Aleksandar Kacanski >

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Arush Kharbanda
1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose t

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Hi Michael, I want to cache a RDD and define get() and set() operators on it. Basically like memcached. Is it possible to build a memcached like distributed cache using Spark SQL ? If not what do you suggest we should use for such operations... Thanks. Deb On Fri, Jul 18, 2014 at 1:00 PM, Michae

Re: can we insert and update with spark sql

2015-02-10 Thread Michael Armbrust
You should look at https://github.com/amplab/spark-indexedrdd On Tue, Feb 10, 2015 at 2:27 PM, Debasish Das wrote: > Hi Michael, > > I want to cache a RDD and define get() and set() operators on it. > Basically like memcached. Is it possible to build a memcached like > distributed cache using Sp

Re: spark sql registerFunction with 1.2.1

2015-02-10 Thread Michael Armbrust
The simple SQL parser doesn't yet support UDFs. Try using a HiveContext. On Tue, Feb 10, 2015 at 1:44 PM, Mohnish Kodnani wrote: > Hi, > I am trying a very simple registerFunction and it is giving me errors. > > I have a parquet file which I register as temp table. > Then I define a UDF. > > de

Re: Beginner in Spark

2015-02-10 Thread King sami
2015-02-06 17:28 GMT+00:00 King sami : > The purpose is to build a data processing system for door events. An event > will describe a door unlocking > with a badge system. This event will differentiate unlocking by somebody > from the inside and by somebody > from the outside. > > *Producing the e

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Silvio Fiorito
Hi Todd, What you could do is run some SparkSQL commands immediately after the Thrift server starts up. Or does Tableau have some init SQL commands you could run? You can actually load data using SQL, such as: create temporary table people using org.apache.spark.sql.json options (path 'exampl

Re: Need a partner

2015-02-10 Thread Joanne Contact
me too. I don't have enough nodes. Maybe we can set up a cluster together? On Tue, Feb 10, 2015 at 9:51 AM, Mohit Singh wrote: > I would be interested too. > > On Tue, Feb 10, 2015 at 9:41 AM, Kartik Mehta > wrote: > >> Hi Sami and fellow Spark friends, >> >> I too am looking for joint learning

Re: spark sql registerFunction with 1.2.1

2015-02-10 Thread Mohnish Kodnani
actually i tried in spark shell , got same error and then for some reason i tried to back tick the "timestamp" and it worked. val result = sqlContext.sql("select toSeconds(`timestamp`) as t, count(rid) as qps from blah group by toSeconds(`timestamp`),qi.clientName") so, it seems sql context is su

Writable serialization from InputFormat losing fields

2015-02-10 Thread Corey Nolet
I am using an input format to load data from Accumulo [1] in to a Spark RDD. It looks like something is happening in the serialization of my output writable between the time it is emitted from the InputFormat and the time it reaches its destination on the driver. What's happening is that the resul

Spark Summit East - March 18-19 - NYC

2015-02-10 Thread Scott walent
The inaugural Spark Summit East, an event to bring the Apache Spark community together, will be in New York City on March 18, 2015. We are excited about the growth of Spark and to bring the event to the east coast. At Spark Summit East you can look forward to hearing from Matei Zaharia, Databricks

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Davies Liu
It's brave to broadcast 8G pickled data, it will take more than 15G in memory for each Python worker, how much memory do you have in executor and driver? Do you see any other exceptions in driver and executors? Something related to serialization in JVM. On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar

Spark installation

2015-02-10 Thread King sami
Hi, I'm new in Spark. I want to install it on my local machine (Ubunti 12.04) Could you help me please to install step by step Spark on may machine and run some Scala programms. Thanks

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Thanks...this is what I was looking for... It will be great if Ankur can give brief details about it...Basically how does it contrast with memcached for example... On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust wrote: > You should look at https://github.com/amplab/spark-indexedrdd > > On Tue

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Todd Nist
Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda wrote: > 1. Can the connector fetch or query schemaRDD's sa

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Also I wanted to run get() and set() from mapPartitions (from spark workers and not master)... To be able to do that I think I have to create a separate spark context for the cache... But I am not sure how SparkContext from job1 can access SparkContext from job2 ! On Tue, Feb 10, 2015 at 3:25 P

Re: Spark installation

2015-02-10 Thread Mohit Singh
For local machine, I dont think there is any to install.. Just unzip and go to $SPARK_DIR/bin/spark-shell and that will open up a repl... On Tue, Feb 10, 2015 at 3:25 PM, King sami wrote: > Hi, > > I'm new in Spark. I want to install it on my local machine (Ubunti 12.04) > Could you help me plea

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Todd Nist
Hi Silvio, Ah, I like that, there is a section in Tableau for "Initial SQL" to be executed upon connecting this would fit well there. I guess I will need to issue a collect(), coalesce(1,true).saveAsTextFile(...) or use repartition(1), as the file currently is being broken into multiple parts.

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Todd Nist
Arush, As for #2 do you mean something like this from the docs: // sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resour

Re: Open file limit settings for Spark on Yarn job

2015-02-10 Thread Sandy Ryza
Hi Arun, The limit for the YARN user on the cluster nodes should be all that matters. What version of Spark are you using? If you can turn on sort-based shuffle it should solve this problem. -Sandy On Tue, Feb 10, 2015 at 1:16 PM, Arun Luthra wrote: > Hi, > > I'm running Spark on Yarn from a

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Silvio Fiorito
Todd, I just tried it in bin/spark-sql shell. I created a folder json and just put 2 copies of the same people.json file This is what I ran: spark-sql> create temporary table people > using org.apache.spark.sql.json > options (path 'examples/src/main/resources/json/*')

Re: Writable serialization from InputFormat losing fields

2015-02-10 Thread Corey Nolet
I am able to get around the problem by doing a map and getting the Event out of the EventWritable before I do my collect. I think I'll do that for now. On Tue, Feb 10, 2015 at 6:04 PM, Corey Nolet wrote: > I am using an input format to load data from Accumulo [1] in to a Spark > RDD. It looks l

spark, reading from s3

2015-02-10 Thread Kane Kim
I'm getting this warning when using s3 input: 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 0 seconds. Retrying connection. After that there are tons of 403/forbidden erro

Bug in ElasticSearch and Spark SQL: Using SQL to query out data from JSON documents is totally wrong!

2015-02-10 Thread Aris
I'm using ElasticSearch with elasticsearch-spark-BUILD-SNAPSHOT and Spark/SparkSQL 1.2.0, from Costin Leau's advice. I want to query ElasticSearch for a bunch of JSON documents from within SparkSQL, and then use a SQL query to simply query for a column, which is actually a JSON key -- normal thing

Re: [spark sql] "add file" doesn't work

2015-02-10 Thread wangzhenhua (G)
[Additional info] I was using the master branch of 9 Feb 2015, the latest commit in "git info" is: commit 0793ee1b4dea1f4b0df749e8ad7c1ab70b512faf Author: Sandy Ryza Date: Mon Feb 9 10:12:12 2015 + SPARK-2149. [MLLIB] Univariate kernel density estimation Author: Sandy Ryza

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Todd Nist
Hi Silvio, So the "Initial SQL" is executing now, I did not have the "*" added that and it worked fine. FWIW the "*" is not needed for the parquet files: create temporary table test using org.apache.spark.sql.json options (path '/data/out/*') ; cache table test; select count(1) from test; Unfor

Why does spark write huge file into temporary local disk even without on-disk persist or checkpoint?

2015-02-10 Thread Peng Cheng
I'm running a small job on a cluster with 15G of mem and 8G of disk per machine. The job always get into a deadlock where the last error message is: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write

Re: Lost task - connection closed

2015-02-10 Thread Tianshuo Deng
I have seen the same problem, It causes some tasks to fail, but not the whole job to fail. Hope someone could shed some light on what could be the cause of this. On Mon, Jan 26, 2015 at 9:49 AM, Aaron Davidson wrote: > It looks like something weird is going on with your object serialization, > p

Build spark failed with maven

2015-02-10 Thread Yi Tian
Hi, all I got an ERROR when I build spark master branch with maven (commit: |2d1e916730492f5d61b97da6c483d3223ca44315|) |[INFO] [INFO] [INFO] Building Spark Project Catalyst 1.3.0-SNAPSHOT [INFO] -

Writing to HDFS from spark Streaming

2015-02-10 Thread Bahubali Jain
Hi, I am facing issues while writing data from a streaming rdd to hdfs.. JavaPairDstream temp; ... ... temp.saveAsHadoopFiles("DailyCSV",".txt", String.class, String.class,TextOutputFormat.class); I see compilation issues as below... The method saveAsHadoopFiles(String, String, Class, Class, Cla

Something about the cluster upgrade

2015-02-10 Thread qiaou
Hi, Now I need to upgrade my spark cluster from version 1.1.0 to 1.2.1 , if there is convenient way to do this. something like ./start-dfs.sh (http://start-dfs.sh) -upgrade in hadoop Best Wishs THX -- qiaou 已使用 Sparrow (http://www.sparrowmailapp.com/?sig)

org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up.

2015-02-10 Thread lakewood
Hi, I'm new to Spark. I have built small spark on yarn cluster, which contains 1 master(20GB RAM, 8 core), 3 worker(4GB RAM, 4 core). When trying to run a command sc.parallelize(1 to 1000).count() through $SPARK_HOME/bin/spark-shell, sometimes the command can submit a job successfully, sometimes i

  1   2   >