Re: Spark with HBase

2014-07-04 Thread 田毅
Hi, I met this issue before. the reason is the hbase client using in spark is 0.94.6, and your server is 0.96.1.1 to fix this issue, you could choose one way: a) deploy a hbase cluster with version 0.94.6 b) rebuild the spark code step 1: modify the hbase version in pom.xml to 0.96.1.1

Spark streaming kafka cost long time at "take at DStream.scala:586"

2014-07-04 Thread xiemeilong
I am using : kafka 0.8.1 spark-streaming-kafka_2.10-0.9.0-cdh5.0.2 My analysis is simple, so I confuse why it cost so long time at "take at DStream.scala:586", it cost 2 to 8 minutes or longer .I don't know how to find the reason. Hoping your help. Sorry for my poor english. -- View this m

classnotfound error due to groupByKey

2014-07-04 Thread Joe L
Hi, When I run the following a piece of code, it is throwing a classnotfound error. Any suggestion would be appreciated. Wanted to group an RDD by key: val t = rdd.groupByKey() Error message: java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$ Thanks

Java sample for using cassandra-driver-spark

2014-07-04 Thread M Singh
Hi: Is there a Java sample fragment for using cassandra-driver-spark ? Thanks

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-04 Thread Thomas Robert
Hi all, I too am having some issues with *RegressionWithSGD algorithms. Concerning your issue Eustache, this could be due to the fact that these regression algorithms uses a fixed step (that is divided by sqrt(iteration)). During my tests, quite often, the algorithm diverged an infinity cost, I g

Re: reading compress lzo files

2014-07-04 Thread Gurvinder Singh
an update on this issue, now spark is able to read the lzo file and recognize that it has index and starts multiple map tasks. you need to use following function instead of textFile csv = sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","

RE: How to use groupByKey and CqlPagingInputFormat

2014-07-04 Thread Mohammed Guller
As far as I know, there is not much difference, except that the outer parenthesis is redundant. The problem with your original code was that there was mismatch in the opening and closing parenthesis. Sometimes the error messages are misleading :-) Do you see any performance difference with the

Re: SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Abel Coronado Iruegas
Thank you, DataBricks Rules On Fri, Jul 4, 2014 at 1:58 PM, Michael Armbrust wrote: > sqlContext.jsonFile("data.json") < Is this already available in the >> master branch??? >> > > Yes, and it will be available in the soon to come 1.0.1 release. > > >> But the question about the use

Re: LIMIT with offset in SQL queries

2014-07-04 Thread Michael Armbrust
Though I'll note that window functions are not yet supported in Spark SQL. https://issues.apache.org/jira/browse/SPARK-1442 On Fri, Jul 4, 2014 at 6:59 AM, Mayur Rustagi wrote: > What I typically do is use row_number & subquery to filter based on that. > It works out pretty well, reduces the it

Re: SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Michael Armbrust
> > sqlContext.jsonFile("data.json") < Is this already available in the > master branch??? > Yes, and it will be available in the soon to come 1.0.1 release. > But the question about the use a combination of resources (Memory > processing & Disk processing) still remains. > This code shoul

Re: Spark SQL user defined functions

2014-07-04 Thread Michael Armbrust
> Sweet. Any idea about when this will be merged into master? > It is probably going to be a couple of weeks. There is a fair amount of cleanup that needs to be done. It works though and we used it in most of the demos at the spark summit. Mostly I just need to add tests and move it out of Hive

Re: window analysis with Spark and Spark streaming

2014-07-04 Thread M Singh
The windowing capabilities of spark streaming determine the events in the RDD created for that time window.  If the duration is 1s then all the events received in a particular 1s window will be a part of the RDD created for that window for that stream. On Friday, July 4, 2014 1:28 PM, alessan

pyspark + yarn: how everything works.

2014-07-04 Thread Egor Pahomov
Hi, I want to use pySpark with yarn. But documentation doesn't give me full understanding on what's going on, and I simply don't understand code. So: 1) How python shipped to cluster? Should machines in cluster already have python? 2) What happens when I write some python code in "map" function -

Re: window analysis with Spark and Spark streaming

2014-07-04 Thread alessandro finamore
Thanks for the replies What is not completely clear to me is how time is managed. I can create a DStream from file. But if I set the window property that will be bounded to the application time, right? If I got it right, with a receiver I can control the way DStream are created. But, how can appl

Re: window analysis with Spark and Spark streaming

2014-07-04 Thread M Singh
Another alternative could be use SparkStreaming's textFileStream with windowing capabilities. On Friday, July 4, 2014 9:52 AM, Gianluca Privitera wrote: You should think about a custom receiver, in order to solve the problem of the “already collected” data.  http://spark.apache.org/docs

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
Interesting - I would have thought they would make that available publicly. Unfortunately, unless you can use Spark on EMR, I guess your options are to hack it by spinning up an EMR cluster and getting the JAR, or maybe fall back to using boto and rolling your own :( On Fri, Jul 4, 2014 at 9:28

Re: DynamoDB input source

2014-07-04 Thread Ian Wilkinson
Trying to discover source for the DynamoDBInputFormat. Not appearing in: - https://github.com/aws/aws-sdk-java - https://github.com/apache/hive Then came across http://stackoverflow.com/questions/1704/jar-containing-org-apache-hadoop-hive-dynamodb. Unsure whether this represents the latest s

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
I should qualify by saying there is boto support for dynamodb - but not for the inputFormat. You could roll your own python-based connection but this involves figuring out how to split the data in dynamo - inputFormat takes care of this so should be the easier approach — Sent from Mailbox On Fr

Re: DynamoDB input source

2014-07-04 Thread Ian Wilkinson
Excellent. Let me get browsing on this. Huge thanks, ian On 4 Jul 2014, at 16:47, Nick Pentreath wrote: > No boto support for that. > > In master there is Python support for loading Hadoop inputFormat. Not sure if > it will be in 1.0.1 or 1.1 > > I master docs under the programming guide a

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
No boto support for that. In master there is Python support for loading Hadoop inputFormat. Not sure if it will be in 1.0.1 or 1.1 I master docs under the programming guide are instructions and also under examples project there are pyspark examples of using Cassandra and HBase. These should h

Re: DynamoDB input source

2014-07-04 Thread Ian Wilkinson
Hi Nick, I’m going to be working with python primarily. Are you aware of comparable boto support? ian On 4 Jul 2014, at 16:32, Nick Pentreath wrote: > You should be able to use DynamoDBInputFormat (I think this should be part of > AWS libraries for Java) and create a HadoopRDD from that. > >

Re: matchError:null in ALS.train

2014-07-04 Thread Nick Pentreath
Do you mind posting a little more detail about what your code looks like? It appears you might be trying to reference another RDD from within your RDD in the foreach. On Fri, Jul 4, 2014 at 2:28 AM, Honey Joshi wrote: > Original Message ---

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
You should be able to use DynamoDBInputFormat (I think this should be part of AWS libraries for Java) and create a HadoopRDD from that. On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson wrote: > Hi, > > I noticed mention of DynamoDB as input source in > > http://ampcamp.berkeley.edu/wp-content/uplo

DynamoDB input source

2014-07-04 Thread Ian Wilkinson
Hi, I noticed mention of DynamoDB as input source in http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf. Unfortunately, Google is not coming to my rescue on finding further mention for this support. Any pointers would be well received. Big than

Re: Spark job tracker.

2014-07-04 Thread Mayur Rustagi
The application server doesnt provide json api unlike the cluster interface(8080). If you are okay to patch spark, you can use our patch to create json API, or you can use sparklistener interface in your application to get that info out. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalyt

Re: Visualize task distribution in cluster

2014-07-04 Thread Mayur Rustagi
You'll get most of that information from mesos interface. You may not get transfer of data information particularly. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Jul 3, 2014 at 6:28 AM, Tobias Pfeiffer wrote: >

Re: Distribute data from Kafka evenly on cluster

2014-07-04 Thread Tobias Pfeiffer
Hi, unfortunately, when I go the above approach, I run into this problem: http://mail-archives.apache.org/mod_mbox/kafka-users/201401.mbox/%3ccabtfevyxvtaqvnmvwmh7yscfgxpw5kmrnw_gnq72cy4oa1b...@mail.gmail.com%3E That is, a NoNode error in Zookeeper when rebalancing. The Kafka receiver will retry

Re: SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Abel Coronado Iruegas
Ok i find this slides of Yin Huai ( http://spark-summit.org/wp-content/uploads/2014/07/Easy-json-Data-Manipulation-Yin-Huai.pdf ) to read a Json file the code seem pretty simple : sqlContext.jsonFile("data.json") < Is this already available in the master branch??? But the question about the

SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Abel Coronado Iruegas
Hi everybody Someone can tell me if it is possible to read and filter a 60 GB file of tweets (Json Docs) in a Standalone Spark Deployment that runs in a single machine with 40 Gb RAM and 8 cores??? I mean, is it possible to configure Spark to work with some amount of memory (20 GB) and the rest o

Re: LIMIT with offset in SQL queries

2014-07-04 Thread Mayur Rustagi
What I typically do is use row_number & subquery to filter based on that. It works out pretty well, reduces the iteration. I think a offset solution based on windowsing directly would be useful. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: window analysis with Spark and Spark streaming

2014-07-04 Thread Gianluca Privitera
You should think about a custom receiver, in order to solve the problem of the “already collected” data. http://spark.apache.org/docs/latest/streaming-custom-receivers.html Gianluca On 04 Jul 2014, at 15:46, alessandro finamore mailto:alessandro.finam...@polito.it>> wrote: Hi, I have a large

window analysis with Spark and Spark streaming

2014-07-04 Thread alessandro finamore
Hi, I have a large dataset of text logs files on which I need to implement "window analysis" Say, extract per-minute data and do aggregated stats on the last X minutes I've to implement the windowing analysis with spark. This is the workflow I'm currently using - read a file and I create a new RD

Re: Spark memory optimization

2014-07-04 Thread Surendranauth Hiraman
When using DISK_ONLY, keep in mind that disk I/O is pretty high. Make sure you are writing to multiple disks for best operation. And even with DISK_ONLY, we've found that there is a minimum threshold for executor ram (spark.executor.memory), which for us seemed to be around 8 GB. If you find that,

Re: Spark memory optimization

2014-07-04 Thread Mayur Rustagi
I would go with Spark only if you are certain that you are going to scale out in the near future. You can change the default storage of RDD to DISK_ONLY, that might remove issues around any rdd leveraging memory. Thr are some functions particularly sortbykey that require data to fit in memory to wo

RE: No FileSystem for scheme: hdfs

2014-07-04 Thread Steven Cox
Thanks for the help folks. Adding the config files was necessary but not sufficient. I also had hadoop 1.0.4 classes on the classpath because a bad jar: spark-0.9.1/jars/spark-assembly-0.9.1-hadoop1.0.4.jar was in my spark executor tar.gz (stored in HDFS). I believe this was due to a bit of

Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-04 Thread Konstantin Kudryavtsev
Hi all, I stuck in issue with runing spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html (for HDP2) The run example from spark web-site: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --

Re: Spark SQL user defined functions

2014-07-04 Thread Martin Gammelsæter
On Fri, Jul 4, 2014 at 11:39 AM, Michael Armbrust wrote: > On Fri, Jul 4, 2014 at 1:59 AM, Martin Gammelsæter > wrote: >> >> is there any way to write user defined functions for Spark SQL? > This is coming in Spark 1.1. There is a work in progress PR here: > https://github.com/apache/spark/pull/

sparck Stdout and stderr

2014-07-04 Thread aminn_524
0 down vote favorite I am running spark-1.0.0 by connecting to a spark standalone cluster which has one master and two slaves. I ran wordcount.py by Spark-submit, actually it reads data from HDFS and also write the results into HDFS. So far everything is fine and the results will correctly be writ

Re: Spark SQL user defined functions

2014-07-04 Thread Michael Armbrust
On Fri, Jul 4, 2014 at 1:59 AM, Martin Gammelsæter < martingammelsae...@gmail.com> wrote: > is there any way to write user defined functions for Spark SQL? This is coming in Spark 1.1. There is a work in progress PR here: https://github.com/apache/spark/pull/1063 If you have a hive context, yo

Re: Spark SQL user defined functions

2014-07-04 Thread Takuya UESHIN
Ah, sorry for misreading. I don't think there is a way to use UDF in your SQLs only with SparkSQL. You might be able to use with SparkHive, but I'm sorry, I don't know well. I think you should use the function before convert to SchemaRDD if you can. Thanks. 2014-07-04 18:16 GMT+09:00 Martin

matchError:null in ALS.train

2014-07-04 Thread Honey Joshi
Original Message Subject: matchError:null in ALS.train From:"Honey Joshi" Date:Thu, July 3, 2014 8:12 am To: user@spark.apache.org -- Hi All, We are usin

Re: issue with running example code

2014-07-04 Thread Gurvinder Singh
In the end it turns out that the issue was caused by a config settings in spark-defaults.conf. After removing this setting spark.files.userClassPathFirst true things are back to normal. Just reporting in case f someone will have the same issue. - Gurvinder On 07/03/2014 06:49 PM, Gurvinder Sin

Re: Spark SQL user defined functions

2014-07-04 Thread Martin Gammelsæter
Takuya, thanks for your reply :) I am already doing that, and it is working well. My question is, can I define arbitrary functions to be used in these queries? On Fri, Jul 4, 2014 at 11:12 AM, Takuya UESHIN wrote: > Hi, > > You can convert standard RDD of Product class (e.g. case class) to Schema

Re: OFF_HEAP storage level

2014-07-04 Thread Ajay Srivastava
Thanks Jerry. It looks like a good option, will try it. Regards, Ajay On Friday, July 4, 2014 2:18 PM, "Shao, Saisai" wrote: Hi Ajay,   StorageLevel OFF_HEAP means for can cache your RDD into Tachyon, the prerequisite is that you should deploy Tachyon among Spark.   Yes, it can alleviate

Re: Spark SQL user defined functions

2014-07-04 Thread Takuya UESHIN
Hi, You can convert standard RDD of Product class (e.g. case class) to SchemaRDD by SQLContext. Load data from Cassandra into RDD of case class, convert it to SchemaRDD and register it, then you can use it in your SQLs. http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-on

Fwd: Graphx traversal and merge interesting edges

2014-07-04 Thread H Bolke
Hello Gurus, Pardon me I am noob @ Spark & GraphX (& Scala) And I seek your wisdome here.. I want to know how to do a graph traversal and do selective merge on edges... Thanks to the documentation :-) I could create a simple graph of employees & their colleagues. The a structure of Graph is below

Spark memory optimization

2014-07-04 Thread Igor Pernek
Hi all! I have a folder with 150 G of txt files (around 700 files, on average each 200 MB). I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that: - manually loop through all the files, do the calculations per file and me

spark and mesos issue

2014-07-04 Thread Gurvinder Singh
We are getting this issue when we are running jobs with close to 1000 workers. Spark is from the github version and mesos is 0.19.0 ERROR storage.BlockManagerMasterActor: Got two different block manager registrations on 201407031041-1227224054-5050-24004-0 Googling about it seems that mesos is st

Graphx traversal and merge interesting edges

2014-07-04 Thread HHB
Hello Gurus, Pardon me I am noob @ Spark & GraphX (& Scala) And I seek your wisdome here.. I want to know how to do a graph traversal and do selective merge on edges... Thanks to the documentation :-) I could create a simple graph of employees & their colleagues. The a structure of Graph is be

Spark SQL user defined functions

2014-07-04 Thread Martin Gammelsæter
Hi! I have a Spark cluster running on top of a Cassandra cluster, using Datastax' new driver, and one of the fields of my RDDs is an XML-string. In a normal Scala sparkjob, parsing that data is no problem, but I would like to also make that information available through Spark SQL. So, is there any

RE: OFF_HEAP storage level

2014-07-04 Thread Shao, Saisai
Hi Ajay, StorageLevel OFF_HEAP means for can cache your RDD into Tachyon, the prerequisite is that you should deploy Tachyon among Spark. Yes, it can alleviate GC, since you offload JVM memory into system managed memory. You can use rdd.persist(...) to use this level, details can be checked in

Re: No FileSystem for scheme: hdfs

2014-07-04 Thread Juan Rodríguez Hortalá
Hi, To cope with the issue with META-INF that Sean is pointing out, my solution is replacing maven-assembly.plugin with maven-shade-plugin, using the ServicesResourceTransformer ( http://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ServicesResourceTransformer) "t

RE: Spark with HBase

2014-07-04 Thread N . Venkata Naga Ravi
Hi, Any update on the solution? We are still facing this issue... We could able to connect to HBase with independent code, but getting issue with Spark integration. Thx, Ravi From: nvn_r...@hotmail.com To: u...@spark.incubator.apache.org; user@spark.apache.org Subject: RE: Spark with HBase Date

Re: No FileSystem for scheme: hdfs

2014-07-04 Thread Sean Owen
"No file system for scheme", in the past for me, has meant that files in META-INF/services have collided when building an uber jar. There's a sort-of-obscure mechanism in Java for registering implementations of a service's interface, and Hadoop uses it for FileSystem. It consists of listing classes

Re: Spark Streaming on top of Cassandra?

2014-07-04 Thread Cesar Arevalo
Hi Zarzyk: If I were you, just to start, I would look at the following: https://groups.google.com/forum/#!topic/spark-users/htQQA3KidEQ http://www.slideshare.net/planetcassandra/south-bay-cassandrealtime-analytics-using-cassandra-spark-and-shark-at-ooyala http://spark-summit.org/2014/talk/using-s

Re: Spark Streaming on top of Cassandra?

2014-07-04 Thread Cesar Arevalo
Hi Zarzyk: If I were you, just to start, I would look at the following: https://groups.google.com/forum/#!topic/spark-users/htQQA3KidEQ http://www.slideshare.net/planetcassandra/south-bay-cassandrealtime-analytics-using-cassandra-spark-and-shark-at-ooyala http://spark-summit.org/2014/talk/using-s

Re: write event logs with YARN

2014-07-04 Thread Christophe Préaud
Hi Andrew, Thanks for your explanation, I confirm that the entries show up in the history server UI when I create empty APPLICATION_COMPLETE files for each of them. Christophe. On 03/07/2014 18:27, Andrew Or wrote: Hi Christophe, another Andrew speaking. Your configuration looks fine to me. Fr

Re: How to use groupByKey and CqlPagingInputFormat

2014-07-04 Thread Martin Gammelsæter
On Thu, Jul 3, 2014 at 10:29 PM, Mohammed Guller wrote: > Martin, > > 1) The first map contains the columns in the primary key, which could be a > compound primary key containing multiple columns, and the second map > contains all the non-key columns. Ah, thank you, that makes sense. > 2) try

Re: Spark Streaming on top of Cassandra?

2014-07-04 Thread zarzyk
Hi, I bump this thread as I'm also interested in the answer. Can anyone help or point to the information on how to do Spark Streaming from/to Cassandra? Thanks! Zarzyk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-top-of-Cassandra-tp128