Re: Saving to Cassandra from Spark Streaming

2014-10-29 Thread Akhil Das
You need to set the following jar (cassandra-connector ) in the classpath like: ssc.sparkContext.addJar("/path/to/spark-cassandra-connector_2.10-1.1.0-alp

Re: Submitting Spark job on Unix cluster from dev environment (Windows)

2014-10-29 Thread Akhil Das
What are you trying to do? Connecting to a remote cluster from your local windows eclipse environment? Just make sure you meet the following: 1. Set spark.driver.host to your local ip (Where you runs your code, and it should be accessible from the cluster) 2. Make sure no firewall/router configura

Re: unsubscribe

2014-10-29 Thread Akhil Das
Send it to user-unsubscr...@spark.apache.org. Read the community page for more info Thanks Best Regards On Wed, Oct 29, 2014 at 3:32 AM, Ricky Thomas wrote: > >

Re: Batch of updates

2014-10-29 Thread Sean Owen
I don't think accumulators come into play here. Use foreachPartition, not mapPartitions. On Wed, Oct 29, 2014 at 12:43 AM, Flavio Pompermaier wrote: > Sorry but I wasn't able to code my stuff using accumulators as you suggested > :( > In my use case I have to to add elements to an array/list and

Re: Spark Streaming from Kafka

2014-10-29 Thread Akhil Das
Looks like the kafka jar that you are using isn't compatible with your scala version. Thanks Best Regards On Wed, Oct 29, 2014 at 11:50 AM, Harold Nguyen wrote: > Hi, > > Just wondering if you've seen the following error when reading from Kafka: > > ERROR ReceiverTracker: Deregistered receiver

Re: Use RDD like a Iterator

2014-10-29 Thread Sean Owen
Call RDD.toLocalIterator()? https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html On Wed, Oct 29, 2014 at 4:15 AM, Dai, Kevin wrote: > Hi, ALL > > > > I have a RDD[T], can I use it like a iterator. > > That means I can compute every element of this RDD lazily. > > > > Best

Re: Spark Streaming from Kafka

2014-10-29 Thread harold
Thanks! How do I find out which Kafka jar to use for scala 2.10.4? — Sent from Mailbox On Wed, Oct 29, 2014 at 12:26 AM, Akhil Das wrote: > Looks like the kafka jar that you are using isn't compatible with your > scala version. > Thanks > Best Regards > On Wed, Oct 29, 2014 at 11:50 AM, Harold

Re: Spark Streaming from Kafka

2014-10-29 Thread harold
I using kafka_2.10-1.1.0.jar on spark 1.1.0 — Sent from Mailbox On Wed, Oct 29, 2014 at 12:31 AM, null wrote: > Thanks! How do I find out which Kafka jar to use for scala 2.10.4? > — > Sent from Mailbox > On Wed, Oct 29, 2014 at 12:26 AM, Akhil Das > wrote: >> Looks like the kafka jar that you

Spark streaming and save to cassandra and elastic search

2014-10-29 Thread aarthi
Hi I ve written a spark streaming code which streams data from flume to kafka which is received by spark. I ve used update state by key and then for each rdd im saving them into cassandra and elsatic search(by calling 2 different methods). The above parts are working fine when streaming job is sub

Re: pySpark - convert log/txt files into sequenceFile

2014-10-29 Thread Csaba Ragany
Thank you Holden, it works! infile = sc.wholeTextFiles(sys.argv[1]) rdd = sc.parallelize(infile.collect()) rdd.saveAsSequenceFile(sys.argv[2]) Csaba 2014-10-28 17:56 GMT+01:00 Holden Karau : > Hi Csaba, > > It sounds like the API you are looking for is sc.wholeTextFiles :) > > Cheers, > > Hold

Re: Use RDD like a Iterator

2014-10-29 Thread Yanbo Liang
RDD.toLocalIterator() is the suitable solution. But I doubt whether it conform with the design principle of spark and RDD. All RDD transform is lazily computed until it end with some actions. 2014-10-29 15:28 GMT+08:00 Sean Owen : > Call RDD.toLocalIterator()? > > https://spark.apache.org/docs/la

sbt/sbt compile error [FATAL]

2014-10-29 Thread HansPeterS
Hi, I have cloned sparked as: git clone g...@github.com:apache/spark.git cd spark sbt/sbt compile Apparently http://repo.maven.apache.org/maven2 is no longer valid. See the error further below. Is this correct? And what should it be changed to? Everything seems to go smooth until : [in

Re: sbt/sbt compile error [FATAL]

2014-10-29 Thread Soumya Simanta
Are you trying to compile the master branch ? Can you try branch-1.1 ? On Wed, Oct 29, 2014 at 6:55 AM, HansPeterS wrote: > Hi, > > I have cloned sparked as: > git clone g...@github.com:apache/spark.git > cd spark > sbt/sbt compile > > Apparently http://repo.maven.apache.org/maven2 is no longer

Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread arthur.hk.c...@gmail.com
Hi, My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise. Or, any news about when will Spark 1.1.0 on Hive 0.1.3.1 be available? Regards Arthur - To unsubscribe, e-mail: user-unsubscr...@spark.apache.or

RE: FileNotFoundException in appcache shuffle files

2014-10-29 Thread Ganelin, Ilya
Hi Ryan - I've been fighting the exact same issue for well over a month now. I initially saw the issue in 1.02 but it persists in 1.1. Jerry - I believe you are correct that this happens during a pause on long-running jobs on a large data set. Are there any parameters that you suggest tuning to

Re: Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread Cheng Lian
Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0, and related PRs are already merged or being merged to the master branch. On 10/29/14 7:43 PM, arthur.hk.c...@gmail.com wrote: Hi, My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise. Or, any news

Re: Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread arthur.hk.c...@gmail.com
Hi, Thanks for your update. Any idea when will Spark 1.2 be GA? Regards Arthur On 29 Oct, 2014, at 8:22 pm, Cheng Lian wrote: > Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0, and > related PRs are already merged or being merged to the master branch. > > On 10/29/14

Re: Spark streaming and save to cassandra and elastic search

2014-10-29 Thread aarthi
Yes the data is getting processed. I printed the data and the rdd count. The point where data is getting saved is not invoked. I am using the same class and jar for submitting by both methods. Only difference is I am launching by tomcat and there directly by putty. -- View this message in contex

Streaming Question regarding lazy calculations

2014-10-29 Thread sivarani
Hi All I am using spark streaming with kafka streaming for 24/7 My Code is something like JavaDStream data = messages.map(new MapData()); JavaPairDStream> records = data.mapToPair(new dataPair()).groupByKey(100); records.print(); JavaPairDStream result = records.mapValues(new Sum()).updateState

"CANNOT FIND ADDRESS"

2014-10-29 Thread akhandeshi
SparkApplication UI shows that one of the executor "Cannot find Addresss" Aggregated Metrics by Executor Executor ID Address Task Time Total Tasks Failed Tasks Succeeded Tasks Input Shuffle ReadShuffle

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-29 Thread Chester At Work
We used both spray and Akka. To avoid comparability issue, we used spark shaded akka version. It works for us. This is 1.1.0 branch, I have not tried with master branch Chester Sent from my iPad On Oct 28, 2014, at 11:48 PM, Prashant Sharma wrote: > Yes we shade akka to change its protobuf

Re: problem with start-slaves.sh

2014-10-29 Thread Yana Kadiyska
I see this when I start a worker and then try to start it again forgetting it's already running (I don't use start-slaves, I start the slaves individually with start-slave.sh). All this is telling you is that there is already a running process on that machine. You can see it if you do a ps -aef|gre

Spark Performance

2014-10-29 Thread akhandeshi
I am relatively new to spark processing. I am using Spark Java API to process data. I am having trouble processing a data set that I don't think is significantly large. It is joining a dataset that is around 3-4gb each (around 12 gb data). The workflow is: x=RDD1.KeyBy(x).partitionBy(new HashP

Re: "CANNOT FIND ADDRESS"

2014-10-29 Thread Yana Kadiyska
"CANNOT FIND ADDRESS" occurs when your executor has crashed. I'll look further down where it shows each task and see if you see any tasks failed. Then you can examine the error log of that executor and see why it died. On Wed, Oct 29, 2014 at 9:35 AM, akhandeshi wrote: > SparkApplication UI show

Re: Java api overhead?

2014-10-29 Thread Koert Kuipers
since spark holds data structures on heap (and by default tries to work with all data in memory) and its written in Scala seeing lots of scala Tuple2 is not unexpected. how do these numbers relate to your data size? On Oct 27, 2014 2:26 PM, "Sonal Goyal" wrote: > Hi, > > I wanted to understand wh

Re: Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread Mark Hamstra
Sometime after Nov. 15: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage On Wed, Oct 29, 2014 at 5:28 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > Thanks for your update. Any idea when will Spark 1.2 be GA? > > Regards > Arthur > > > On 29 Oct, 2014, at

Re: "CANNOT FIND ADDRESS"

2014-10-29 Thread akhandeshi
Thanks...hmm It is seems to be a timeout issue perhaps?? Not sure what is causing it? or how to debug? I see following error message... 4/10/29 13:26:04 ERROR ContextCleaner: Error cleaning broadcast 9 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonf

XML Utilities for Apache Spark

2014-10-29 Thread Darin McBeath
I developed the spark-xml-utils library because we have a large amount of XML in big datasets and I felt this data could be better served by providing some helpful xml utilities. This includes the ability to filter documents based on an xpath/xquery expression, return specific nodes for an xpath

Re: pySpark - convert log/txt files into sequenceFile

2014-10-29 Thread Davies Liu
Without the second line, it's will much faster: infile = sc.wholeTextFiles(sys.argv[1]) infile.saveAsSequenceFile(sys.argv[2]) On Wed, Oct 29, 2014 at 3:31 AM, Csaba Ragany wrote: > Thank you Holden, it works! > > infile = sc.wholeTextFiles(sys.argv[1]) > rdd = sc.parallelize(infile.collect()

PySpark and Cassandra 2.1 Examples

2014-10-29 Thread Mike Sukmanowsky
Hey all, Just thought I'd share this with the list in case any one else would benefit. Currently working on a proper integration of PySpark and DataStax's new Cassandra-Spark connector, but that's on going. In the meanwhile, I've basically updated the cassandra_inputformat.py and cassandra_outpu

Re: PySpark and Cassandra 2.1 Examples

2014-10-29 Thread Helena Edelson
Nice! - Helena @helenaedelson On Oct 29, 2014, at 12:01 PM, Mike Sukmanowsky wrote: > Hey all, > > Just thought I'd share this with the list in case any one else would benefit. > Currently working on a proper integration of PySpark and DataStax's new > Cassandra-Spark connector, but that'

Spark Streaming with Kinesis

2014-10-29 Thread Harold Nguyen
Hi all, I followed the guide here: http://spark.apache.org/docs/latest/streaming-kinesis-integration.html But got this error: Exception in thread "main" java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider Would you happen to know what dependency or jar is needed ? Harold

Re: Unit Testing (JUnit) with Spark

2014-10-29 Thread touchdown
add these to your dependencies: "io.netty" % "netty" % "3.6.6.Final" exclude("io.netty", "netty-all") to the end of spark and hadoop dependencies reference: https://spark-project.atlassian.net/browse/SPARK-1138 I am using Spark 1.1 so the akka issue is already fixed -- View this message in co

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread Gerard Maas
Hi TD, Thanks a lot for the comprehensive answer. I think this explanation deserves some place in the Spark Streaming tuning guide. -kr, Gerard. On Thu, Oct 23, 2014 at 11:41 PM, Tathagata Das wrote: > Hey Gerard, > > This is a very good question! > > *TL;DR: *The performance should be same,

Re: Spark Streaming with Kinesis

2014-10-29 Thread Harold Nguyen
Hi again, After getting through several dependencies, I finally got to this non-dependency type error: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)

Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
I have a SchemaRDD with 100 records in 1 partition.  We'll call this baseline. I have a SchemaRDD with 11 records in 1 partition.  We'll call this daily. After a fairly basic join of these two tables JavaSchemaRDD results = sqlContext.sql("SELECT id, action, daily.epoch, daily.version FROM baselin

RE: how to retrieve the value of a column of type date/timestamp from a Spark SQL Row

2014-10-29 Thread Mohammed Guller
Thanks, guys. Michael Armbrust also suggested the same two approaches. I believe “getAs[Date]” is available only in 1.2 branch and I have Spark 1.1, so I am using row(i).asInstanceOf[Date], which works. Mohammed From: Shixiong Zhu [mailto:zsxw...@gmail.com] Sent: Tuesday, October 28, 2014 10:23

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread Tathagata Das
Good idea, will do for 1.2 release. On Oct 29, 2014 9:50 AM, "Gerard Maas" wrote: > Hi TD, > > Thanks a lot for the comprehensive answer. > > I think this explanation deserves some place in the Spark Streaming tuning > guide. > > -kr, Gerard. > > On Thu, Oct 23, 2014 at 11:41 PM, Tathagata Das <

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread jay vyas
Hi tathagata. I actually had a few minor improvements to spark streaming in SPARK-4040. possibly i could weave this in w/ my pr ? On Wed, Oct 29, 2014 at 1:59 PM, Tathagata Das wrote: > Good idea, will do for 1.2 release. > On Oct 29, 2014 9:50 AM, "Gerard Maas" wrote: > >> Hi TD, >> >> Thank

winutils

2014-10-29 Thread Ron Ayoub
Apparently Spark does require Hadoop even if you do not intend to use Hadoop. Is there a workaround for the below error I get when creating the SparkContext in Scala? I will note that I didn't have this problem yesterday when creating the Spark context in Java as part of the getting started App.

Re: winutils

2014-10-29 Thread Denny Lee
QQ - did you download the Spark 1.1 binaries that included the Hadoop one? Does this happen if you're using the Spark 1.1 binaries that do not include the Hadoop jars? On Wed, Oct 29, 2014 at 11:31 AM, Ron Ayoub wrote: > Apparently Spark does require Hadoop even if you do not intend to use > Had

Re: "CANNOT FIND ADDRESS"

2014-10-29 Thread Akhil Das
Can you try setting the following while creating the sparkContext and see if the issue still exists? .set("spark.core.connection.ack.wait.timeout","900") .set("spark.akka.frameSize","50") .set("spark.akka.timeout","900") ​Looks like your executor is stuck on GC Pause.​ Thanks Best Reg

Re: Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
ok. after reading some documentation, it would appear the issue is the default number of partitions for a join (200). After doing something like the following, I was able to change the value. From: Darin McBeath To: User Sent: Wednesday, October 29, 2014 1:55 PM Subject: Spark SQL and

Re: Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
Sorry, hit the send key a bitt too early. Anyway, this is the code I set. sqlContext.sql("set spark.sql.shuffle.partitions=10"); From: Darin McBeath To: Darin McBeath ; User Sent: Wednesday, October 29, 2014 2:47 PM Subject: Re: Spark SQL and confused about number of partitions/tasks to

RE: winutils

2014-10-29 Thread Ron Ayoub
Well. I got past this problem and the manner was in my own email. I did download the one with Hadoop since it was among the only ones you don't have to compile from source along with CDH and Map. It worked yesterday because I added 1.1.0 as a maven dependency from the repository. I just did the

RE: winutils

2014-10-29 Thread Ron Ayoub
Well. I got past this problem and the manner was in my own email. I did download the one with Hadoop since it was among the only ones you don't have to compile from source along with CDH and Map. It worked yesterday because I added 1.1.0 as a maven dependency from the repository. I just did the

Questions about serialization and SparkConf

2014-10-29 Thread Steve Lewis
Assume in my executor I say SparkConf sparkConf = new SparkConf(); sparkConf.set("spark.kryo.registrator", "com.lordjoe.distributed.hydra.HydraKryoSerializer"); sparkConf.set("mysparc.data", "Some user Data"); sparkConf.setAppName("Some App"); Now 1) Are there defa

Re: Spark Streaming with Kinesis

2014-10-29 Thread Matt Chu
I haven't tried this myself yet, but this sounds relevant: https://github.com/apache/spark/pull/2535 Will be giving this a try today or so, will report back. On Wednesday, October 29, 2014, Harold Nguyen wrote: > Hi again, > > After getting through several dependencies, I finally got to this >

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-29 Thread Michael Armbrust
We are working on more helpful error messages, but in the meantime let me explain how to read this output. org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'p.name,'p.age, tree: Project ['p.name,'p.age] Filter ('location.number = 2300) Join Inner, Some((lo

RE: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-29 Thread Mohammed Guller
I am not sure about that. Can you try a Spray version built with 2.2.x along with Spark 1.1 and include the Akka dependencies in your project’s sbt file? Mohammed From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Tuesday, October 28, 2014 8:58 PM To: Mohammed Guller Cc: user Subject: R

Spark SQL - how to query dates stored as millis?

2014-10-29 Thread bkarels
I have been searching and have not found a solution as to how one might query on dates stored as UTC milliseconds from the epoch. The schema I have pulled in from a NoSQL datasource (JSON from MongoDB) has the target date as: |-- dateCreated: struct (nullable = true) ||-- $date: long (nulla

Re: winutils

2014-10-29 Thread Sean Owen
cf. https://issues.apache.org/jira/browse/SPARK-2356 On Wed, Oct 29, 2014 at 7:31 PM, Ron Ayoub wrote: > Apparently Spark does require Hadoop even if you do not intend to use > Hadoop. Is there a workaround for the below error I get when creating the > SparkContext in Scala? > > I will note that

difference between --jars and --driver-class-path

2014-10-29 Thread freedafeng
I have a python program got exception when using --jars, but works fine if using --driver-class-path (with that examples.jar). What's the difference between these two args? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/difference-between-jars-and-d

Convert DStream to String

2014-10-29 Thread Harold Nguyen
Hi all, How do I convert a DStream to a string ? For instance, I want to be able to: val myword = words.filter(word => word.startsWith("blah")) And use "myword" in other places, like tacking it onto (key, value) pairs, like so: val pairs = words.map(word => (myword+"_"+word, 1)) Thanks for an

what does DStream.union() do?

2014-10-29 Thread spr
The documentation at https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.streaming.dstream.DStream describes the union() method as "Return a new DStream by unifying data of another DStream with this DStream." Can somebody provide a clear definition of what "unifying" means i

BUG: when running as "extends App", closures don't capture variables

2014-10-29 Thread Michael Albert
Greetings! This might be a documentation issue as opposed to a coding issue, in that perhaps the correct answer is "don't do that", but as this is not obvious, I am writing. The following code produces output most would not expect: package misc import org.apache.spark.SparkConfimport org.apache.s

Re: what does DStream.union() do?

2014-10-29 Thread Holden Karau
The union function simply returns a DStream with the elements from both. This is the same behavior as when we call union on RDDs :) (You can think of union as similar to the union operator on sets except without the unique element restrictions). On Wed, Oct 29, 2014 at 3:15 PM, spr wrote: > The

how to extract/combine elements of an Array in DStream element?

2014-10-29 Thread spr
I am processing a log file, from each line of which I want to extract the zeroth and 4th elements (and an integer 1 for counting) into a tuple. I had hoped to be able to index the Array for elements 0 and 4, but Arrays appear not to support vector indexing. I'm not finding a way to extract and co

Re: how to extract/combine elements of an Array in DStream element?

2014-10-29 Thread Holden Karau
On Wed, Oct 29, 2014 at 3:29 PM, spr wrote: > I am processing a log file, from each line of which I want to extract the > zeroth and 4th elements (and an integer 1 for counting) into a tuple. I > had > hoped to be able to index the Array for elements 0 and 4, but Arrays appear > not to support v

Spark related meet up on Nov 6th in SF

2014-10-29 Thread Alexis Roos
Hi all, We’re organizing a meet up on Nov 6th in our office downtown SF that might be of interest to the Spark community. We will be discussing our experience building our first production Spark based application. More details and sign up info here: https://www.eventbrite.com/e/from-hadoop-to-sp

Re: Convert DStream to String

2014-10-29 Thread Sean Owen
What would it mean to make a DStream into a String? it's inherently a sequence of things over time, each of which might be a string but which are usually RDDs of things. On Wed, Oct 29, 2014 at 11:15 PM, Harold Nguyen wrote: > Hi all, > > How do I convert a DStream to a string ? > > For instance,

Re: what does DStream.union() do?

2014-10-29 Thread spr
I need more precision to understand. If the elements of one DStream/RDD are (String) and the elements of the other are (Time, Int), what does "union" mean? I'm hoping for (String, Time, Int) but that appears optimistic. :) Do the elements have to be of homogeneous type? Holden Karau wrote >

Re: Convert DStream to String

2014-10-29 Thread Harold Nguyen
Hi Sean, I'd just like to take the first "word" of every line, and use it as a variable for later. Is there a way to do that? Here's the gist of what I want to do: val lines = KafkaUtils.createStream(ssc, "localhost:2181", "test", Map("test" -> 10)).map(_._2) val words = lines.flatMap(_.spli

Re: what does DStream.union() do?

2014-10-29 Thread Holden Karau
On Wed, Oct 29, 2014 at 3:39 PM, spr wrote: > I need more precision to understand. If the elements of one DStream/RDD > are > (String) and the elements of the other are (Time, Int), what does "union" > mean? I'm hoping for (String, Time, Int) but that appears optimistic. :) > It won't compile.

Re: BUG: when running as "extends App", closures don't capture variables

2014-10-29 Thread Matei Zaharia
Good catch! If you'd like, you can send a pull request changing the files in docs/ to do this (see https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ), otherwise maybe open an issue on https://issues.

Re: Convert DStream to String

2014-10-29 Thread Sean Owen
Sure, that code looks like it does sort of what you describe but it's mixed up in a few ways. It looks like you only want to operate on words that start with SECRETWORD, but then you are prepending acct and _ in the code but expecting something appending in the result. You also seem like you want t

Spark with HLists

2014-10-29 Thread Simon Hafner
I tried using shapeless HLists as data storage for data inside spark. Unsurprisingly, it failed. The deserialization isn't well-defined because of all the implicits used by shapeless. How could I make it work? Sample Code: /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apac

How does custom partitioning in PySpark work?

2014-10-29 Thread Def_Os
I want several RDDs (which are the result of my program's operations on existing RDDs) to match the partitioning of an existing RDD, since they will be joined together in the end. Do I understand correctly that I would benefit from using a custom partitioner that would be applied to all RDDs? Seco

Re: Spark with HLists

2014-10-29 Thread Koert Kuipers
looks like a misssing class issue? what makes you think its serialization? shapeless does indeed have a lot of helper classes that get sucked in and are not serializable. see here: https://groups.google.com/forum/#!topic/shapeless-dev/05_DXnoVnI4 and for a project that uses shapeless in spark see

use additional ebs volumes for hsdf storage with spark-ec2

2014-10-29 Thread Daniel Mahler
I started my ec2 spark cluster with ./ec2/spark---ebs-vol-{size=100,num=8,type=gp2} -t m3.xlarge -s 10 launch mycluster I see the additional volumes attached but they do not seem to be set up for hdfs. How can I check if they are being utilized on all workers, and how can I get all workers to

SparkSQL: Nested Query error

2014-10-29 Thread SK
Hi, I am using Spark 1.1.0. I have the following SQL statement where I am trying to count the number of UIDs that are in the "tusers" table but not in the "device" table. val users_with_no_device = sql_cxt.sql("SELECT COUNT (u_uid) FROM tusers WHERE tusers.u_uid NOT IN (SELECT d_uid FROM device)"

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-10-29 Thread Chris Fregly
jira created with comments/references to this discussion: https://issues.apache.org/jira/browse/SPARK-4144 On Tue, Aug 19, 2014 at 4:47 PM, Xiangrui Meng wrote: > No. Please create one but it won't be able to catch the v1.1 train. > -Xiangrui > > On Tue, Aug 19, 2014 at 4:22 PM, Chris Fregly

Re: Submitting Spark job on Unix cluster from dev environment (Windows)

2014-10-29 Thread Shailesh Birari
Thanks by setting driver host to Windows and specifying some ports (like driver, fileserver, broadcast etc..) it worked perfectly. I need to specify those ports as not all ports are open on my machine. For, driver host name, I was assuming Spark should get it, as in case of linux we are not settin

Task Size Increases when using loops

2014-10-29 Thread nsareen
Hi,I'm new to spark, and am facing a peculiar problem. I'm writing a simple Java Driver program where i'm creating Key / Value data structure and collecting them, once created. The problem i'm facing is that, when i increase the iterations of a for loop which creates the ArrayList of Long Values wh

GC Issues with randomSplit on large dataset

2014-10-29 Thread Ganelin, Ilya
Hey all – not writing to necessarily get a fix but more to get an understanding of what’s going on internally here. I wish to take a cross-product of two very large RDDs (using cartesian), the product of which is well in excess of what can be stored on disk . Clearly that is intractable, thus m

spark-submit results in NoClassDefFoundError

2014-10-29 Thread Tobias Pfeiffer
Hi, I am trying to get my Spark application to run on YARN and by now I have managed to build a fat jar as described on < http://markmail.org/message/c6no2nyaqjdujnkq> (which is the only really usable manual on how to get such a jar file). My code runs fine using "sbt test" and "sbt run", but whe

Re: spark-submit results in NoClassDefFoundError

2014-10-29 Thread Tobias Pfeiffer
Hi again, On Thu, Oct 30, 2014 at 11:50 AM, Tobias Pfeiffer wrote: > > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > Exception in thread "main" java.lang.NoClassDefFoundError: > com/typesafe/scalalogging/slf4j/Logger > It turned out scalalogging was not inc

Re: Questions about serialization and SparkConf

2014-10-29 Thread Ilya Ganelin
Hello Steve . 1) When you call new SparkConf you should get an object with the default config values. You can reference the spark configuration and tuning pages for details on what those are. 2) Yes. Properties set in this configuration will be pushed down to worker nodes actually executing the s

RE: problem with start-slaves.sh

2014-10-29 Thread Pagliari, Roberto
hi Yana, in my case I did not start any spark worker. However, shark was definitely running. Do you think that might be a problem? I will take a look Thank you, From: Yana Kadiyska [yana.kadiy...@gmail.com] Sent: Wednesday, October 29, 2014 9:45 AM To: Pagliari,

Re: SparkSQL: Nested Query error

2014-10-29 Thread Sanjiv Mittal
You may use - select count(u_uid) from tusers a left outer join device b on (a.u_uid = b.d_uid) where b.d_uid is null On Wed, Oct 29, 2014 at 5:45 PM, SK wrote: > Hi, > > I am using Spark 1.1.0. I have the following SQL statement where I am > trying > to count the number of UIDs that are in th

Algebird using spark-shell

2014-10-29 Thread bdev
I'm running into this error when I attempt to launch spark-shell passing in the algebird-core jar: ~~ $ ./bin/spark-shell --jars algebird-core_2.9.2-0.1.11.jar scala> import com.twitter.algebird._ import com.twitter.algebird._ scala> import HyperLogLog._ import HyperLogLog._ scala

MLLib: libsvm - default value initialization

2014-10-29 Thread Sameer Tilak
Hi All,I have my sparse data in libsvm format. val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt") I am running Linear regression. Let us say that my data has following entry:1 1:0 4:1 I think it will assume 0 for indices 2 and 3, right? I would li

Re: Java api overhead?

2014-10-29 Thread Sonal Goyal
Thanks Koert. These numbers indeed tie back to our data and algorithms. Would going the scala route save some memory, as the java API creates wrapper Tuple2 for all pair functions? On Wednesday, October 29, 2014, Koert Kuipers wrote: > since spark holds data structures on heap (and by default tr

Re: Spark Worker node accessing Hive metastore

2014-10-29 Thread ken
Thanks Akhil. So the worker spark node doesn't need access to metastore to run Hive queries? If yes, which component accesses the metastore? For Hive, the Hive-cli accesses the metastore before submitting M/R jobs. Thanks, Ken -- View this message in context: http://apache-spark-user-list.1

Re: SparkSQL: Nested Query error

2014-10-29 Thread SK
Hi, I am getting an error in the Query Plan when I use the SQL statement exactly as you have suggested. Is that the exact SQL statement I should be using (I am not very familiar with SQL syntax)? I also tried using the SchemaRDD's subtract method to perform this query. usersRDD.subtract(deviceR

Spark Debugging

2014-10-29 Thread Naveen Kumar Pokala
Hi, I have installed 2 node hadoop cluster (For example, on Unix machines A and B. A master node and data node, B is data node) I am submitting my driver programs through SPARK 1.1.0 with bin/spark-submit from Putty Client from my Windows machine. I want to debug my program from Eclipse on

Re: Spark Debugging

2014-10-29 Thread Akhil Das
Its called remote debugging. You can read this article for more information. You will have to make sure that the network between your cluster and windows machine can communicate with each other. Thanks Best Regards

Spark Debugging

2014-10-29 Thread Naveen Kumar Pokala
Hi, I have installed 2 node hadoop cluster (For example, on Unix machines A and B. A master node and data node, B is data node) I am submitting my driver programs through SPARK 1.1.0 with bin/spark-submit from Putty Client from my Windows machine. I want to debug my program from Eclipse on

Re: Algebird using spark-shell

2014-10-29 Thread Akhil Das
What version of scala are you having? looks like the jar you are submitting is for scala version 2.9.x Thanks Best Regards On Thu, Oct 30, 2014 at 9:49 AM, bdev wrote: > I'm running into this error when I attempt to launch spark-shell passing in > the algebird-core jar: > > ~~ > $ ./bin/spa

Re: use additional ebs volumes for hsdf storage with spark-ec2

2014-10-29 Thread Akhil Das
I think you can check in the core-site.xml or hdfs-site.xml file under /root/ephemeral-hdfs/etc/hadoop/ where you can see data node dir property which will be a comma separated list of volumes. Thanks Best Regards On Thu, Oct 30, 2014 at 5:21 AM, Daniel Mahler wrote: > I started my ec2 spark cl