Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-07-02 Thread Tobias Pfeiffer
Hi, On Thu, Jan 29, 2015 at 9:52 AM, Tobias Pfeiffer wrote: > Hi, > > On Thu, Jan 29, 2015 at 1:54 AM, YaoPau wrote: >> >> My thinking is to maintain state in an RDD and update it an persist it >> with >> each 2-second pass, but this also seems like it cou

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Tobias Pfeiffer
Hi, On Thu, Mar 26, 2015 at 4:08 AM, Khandeshi, Ami < ami.khande...@fmr.com.invalid> wrote: > I am seeing the same behavior. I have enough resources….. > CPU *and* memory are sufficient? No previous (unfinished) jobs eating them? Tobias

Re: "Timed out while stopping the job generator" plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Hi, I discovered what caused my issue when running on YARN and was able to work around it. On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer wrote: > The processing itself is complete, i.e., the batch currently processed at > the time of stop() is finished and no further batches are pro

Re: "Timed out while stopping the job generator" plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Sean, On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer wrote: > > it seems like I am unable to shut down my StreamingContext properly, both > in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster > mode, subsequent use of a new StreamingContext wil

Re: SQL with Spark Streaming

2015-03-11 Thread Tobias Pfeiffer
Hi, On Thu, Mar 12, 2015 at 12:08 AM, Huang, Jie wrote: > > According to my understanding, your approach is to register a series of > tables by using transformWith, right? And then, you can get a new Dstream > (i.e., SchemaDstream), which consists of lots of SchemaRDDs. > Yep, it's basically the

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Tobias Pfeiffer
Hi, On Wed, Mar 11, 2015 at 11:05 PM, Cesar Flores wrote: > > Thanks for both answers. One final question. *This registerTempTable is > not an extra process that the SQL queries need to do that may decrease > performance over the language integrated method calls? * > As far as I know, registerTe

"Timed out while stopping the job generator" plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Hi, it seems like I am unable to shut down my StreamingContext properly, both in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster mode, subsequent use of a new StreamingContext will raise an InvalidActorNameException. I use logger.info("stoppingStreamingContext") staticStr

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Tobias Pfeiffer
Hi, On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores wrote: > I am new to the SchemaRDD class, and I am trying to decide in using SQL > queries or Language Integrated Queries ( > https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD > ). > > Can someone tell me wha

Re: SQL with Spark Streaming

2015-03-10 Thread Tobias Pfeiffer
Hi, On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao wrote: > Intel has a prototype for doing this, SaiSai and Jason are the authors. > Probably you can ask them for some materials. > The github repository is here: https://github.com/intel-spark/stream-sql Also, what I did is writing a wrapper cla

Re: Spark with data on NFS v HDFS

2015-03-05 Thread Tobias Pfeiffer
Hi, On Thu, Mar 5, 2015 at 10:58 PM, Ashish Mukherjee < ashish.mukher...@gmail.com> wrote: > > I understand Spark can be used with Hadoop or standalone. I have certain > questions related to use of the correct FS for Spark data. > > What is the efficiency trade-off in feeding data to Spark from NF

Re: scala.Double vs java.lang.Double in RDD

2015-03-04 Thread Tobias Pfeiffer
Hi, On Thu, Mar 5, 2015 at 12:20 AM, Imran Rashid wrote: > This doesn't involve spark at all, I think this is entirely an issue with > how scala deals w/ primitives and boxing. Often it can hide the details > for you, but IMO it just leads to far more confusing errors when things > don't work o

scala.Double vs java.lang.Double in RDD

2015-03-04 Thread Tobias Pfeiffer
Hi, I have a function with signature def aggFun1(rdd: RDD[(Long, (Long, Double))]): RDD[(Long, Any)] and one with def aggFun2[_Key: ClassTag, _Index](rdd: RDD[(_Key, (_Index, Double))]): RDD[(_Key, Double)] where all "Double" classes involved are "scala.Double" classes (according t

Re: Issue with yarn cluster - hangs in accepted state.

2015-03-03 Thread Tobias Pfeiffer
Hi, On Wed, Mar 4, 2015 at 6:20 AM, Zhan Zhang wrote: > Do you have enough resource in your cluster? You can check your resource > manager to see the usage. > Yep, I can confirm that this is a very annoying issue. If there is not enough memory or VCPUs available, your app will just stay in ACC

Re: Spark sql results can't be printed out to system console from spark streaming application

2015-03-03 Thread Tobias Pfeiffer
Hi, can you explain how you copied that into your *streaming* application? Like, how do you issue the SQL, what data do you operate on, how do you view the logs etc.? Tobias On Wed, Mar 4, 2015 at 8:55 AM, Cui Lin wrote: > > > >Dear all, > > > >I found the below sample code can be printed out

Re: Best practices for query creation in Spark SQL.

2015-03-02 Thread Tobias Pfeiffer
Hi, I think your chances for a satisfying answer would increase dramatically if you elaborated a bit more on what you actually want to know. (Holds for any of your last four questions about Spark SQL...) Tobias

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
On Fri, Feb 27, 2015 at 10:57 AM, Gary Malouf wrote: > So when deciding whether to take on installing/configuring Spark, the size > of the data does not automatically make that decision in your mind. > You got me there ;-) Tobias

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Hi On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf wrote: > The honest answer is that it is unclear to me at this point. I guess what > I am really wondering is if there are cases where one would find it > beneficial to use Spark against one or more RDBs? > Well, RDBs are all about *storage*, wh

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Gary, On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf wrote: > I'm considering whether or not it is worth introducing Spark at my new > company. The data is no-where near Hadoop size at this point (it sits in > an RDS Postgres cluster). > Will it ever become "Hadoop size"? Looking at the overhead

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Tobias Pfeiffer
Hi, On Thu, Feb 26, 2015 at 11:24 AM, Thanigai Vellore < thanigai.vell...@gmail.com> wrote: > It appears that the function immediately returns even before the > foreachrdd stage is executed. Is that possible? > Sure, that's exactly what happens. foreachRDD() schedules a computation, it does not p

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tobias Pfeiffer
Hi, On Tue, Feb 24, 2015 at 4:34 AM, Tathagata Das wrote: > There are different kinds of checkpointing going on. updateStateByKey > requires RDD checkpointing which can be enabled only by called > sparkContext.setCheckpointDirectory. But that does not enable Spark > Streaming driver checkpoints,

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tobias Pfeiffer
Sean, thanks for your message! On Mon, Feb 23, 2015 at 6:03 PM, Sean Owen wrote: > > What I haven't investigated is whether you can enable checkpointing > for the state in updateStateByKey separately from this mechanism, > which is exactly your question. What happens if you set a checkpoint > di

Re: Use Spark Streaming for Batch?

2015-02-22 Thread Tobias Pfeiffer
Hi, On Sat, Feb 21, 2015 at 1:05 AM, craigv wrote: > > /Might it be possible to perform "large batches" processing on HDFS time > > series data using Spark Streaming?/ > > > > 1.I understand that there is not currently an InputDStream that could do > > what's needed. I would have to create such

Re: spark challenge: zip with next???

2015-01-29 Thread Tobias Pfeiffer
Hi, On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya wrote: > Make a copy of your RDD with an extra entry in the beginning to offset. > The you can zip the two RDDs and run a map to generate an RDD of > differences. > Does that work? I recently tried something to compute differences between each

Re: SQL query over (Long, JSON string) tuples

2015-01-29 Thread Tobias Pfeiffer
Hi Ayoub, thanks for your mail! On Thu, Jan 29, 2015 at 6:23 PM, Ayoub wrote: > > SQLContext and hiveContext have a "jsonRDD" method which accept an > RDD[String] where the string is a JSON String a returns a SchemaRDD, it > extends RDD[Row] which the type you want. > > After words you should be

SQL query over (Long, JSON string) tuples

2015-01-29 Thread Tobias Pfeiffer
Hi, I have data as RDD[(Long, String)], where the Long is a timestamp and the String is a JSON-encoded string. I want to infer the schema of the JSON and then do a SQL statement on the data (no aggregates, just column selection and UDF application), but still have the timestamp associated with eac

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Tobias Pfeiffer
Hi, On Thu, Jan 29, 2015 at 1:54 AM, YaoPau wrote: > > My thinking is to maintain state in an RDD and update it an persist it with > each 2-second pass, but this also seems like it could get messy. Any > thoughts or examples that might help me? > I have just implemented some timestamp-based win

Re: Error reporting/collecting for users

2015-01-27 Thread Tobias Pfeiffer
Hi, On Wed, Jan 28, 2015 at 1:45 PM, Soumitra Kumar wrote: > It is a Streaming application, so how/when do you plan to access the > accumulator on driver? > Well... maybe there would be some user command or web interface showing the errors that have happened during processing...? Thanks Tobias

Re: Error reporting/collecting for users

2015-01-27 Thread Tobias Pfeiffer
Hi, thanks for your mail! On Wed, Jan 28, 2015 at 11:44 AM, Tathagata Das wrote: > That seems reasonable to me. Are you having any problems doing it this way? > Well, actually I haven't done that yet. The idea of using accumulators to collect errors just came while writing the email, but I tho

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Tobias Pfeiffer
Hi, thanks for the answers! On Wed, Jan 28, 2015 at 11:31 AM, Shao, Saisai wrote: > > Also this `foreachFunc` is more like an action function of RDD, thinking > of rdd.foreach(func), in which `func` need to be serializable. So maybe I > think your way of use it is not a normal way :). > Yeah I

Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Tobias Pfeiffer
Hi, I want to do something like dstream.foreachRDD(rdd => if (someCondition) ssc.stop()) so in particular the function does not touch any element in the RDD and runs completely within the driver. However, this fails with a NotSerializableException because $outer is not serializable etc. The DStr

Error reporting/collecting for users

2015-01-27 Thread Tobias Pfeiffer
Hi, in my Spark Streaming application, computations depend on users' input in terms of * user-defined functions * computation rules * etc. that can throw exceptions in various cases (think: exception in UDF, division by zero, invalid access by key etc.). Now I am wondering about what is a good

Re: [SQL] Conflicts in inferred Json Schemas

2015-01-25 Thread Tobias Pfeiffer
Hi, On Thu, Jan 22, 2015 at 2:26 AM, Corey Nolet wrote: > Let's say I have 2 formats for json objects in the same file > schema1 = { "location": "12345 My Lane" } > schema2 = { "location":{"houseAddres":"1234 My Lane"} } > > From my tests, it looks like the current inferSchema() function will en

Re: spark streaming with checkpoint

2015-01-25 Thread Tobias Pfeiffer
Hi, On Tue, Jan 20, 2015 at 8:16 PM, balu.naren wrote: > I am a beginner to spark streaming. So have a basic doubt regarding > checkpoints. My use case is to calculate the no of unique users by day. I > am using reduce by key and window for this. Where my window duration is 24 > hours and slide

Re: Pairwise Processing of a List

2015-01-25 Thread Tobias Pfeiffer
Sean, On Mon, Jan 26, 2015 at 10:28 AM, Sean Owen wrote: > Note that RDDs don't really guarantee anything about ordering though, > so this only makes sense if you've already sorted some upstream RDD by > a timestamp or sequence number. > Speaking of order, is there some reading on guarantees an

Re: Serializability: for vs. while loops

2015-01-25 Thread Tobias Pfeiffer
Aaron, On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson wrote: > Scala for-loops are implemented as closures using anonymous inner classes > which are instantiated once and invoked many times. This means, though, > that the code inside the loop is actually sitting inside a class, which > confuses

Re: Pairwise Processing of a List

2015-01-25 Thread Tobias Pfeiffer
Hi, On Mon, Jan 26, 2015 at 9:32 AM, Steve Nunez wrote: > I’ve got a list of points: List[(Float, Float)]) that represent (x,y) > coordinate pairs and need to sum the distance. It’s easy enough to compute > the distance: > Are you saying you want all combinations (N^2) of distances? That shoul

Re: Closing over a var with changing value in Streaming application

2015-01-21 Thread Tobias Pfeiffer
Hi, On Wed, Jan 21, 2015 at 9:13 PM, Bob Tiernay wrote: > Maybe I'm misunderstanding something here, but couldn't this be done with > broadcast variables? I there is the following caveat from the docs: > > "In addition, the object v should not be modified after it is broadcast > in order to ensu

Re: Closing over a var with changing value in Streaming application

2015-01-21 Thread Tobias Pfeiffer
Hi again, On Wed, Jan 21, 2015 at 4:53 PM, Tobias Pfeiffer wrote: > > On Wed, Jan 21, 2015 at 4:46 PM, Akhil Das > wrote: > >> How about using accumulators >> <http://spark.apache.org/docs/1.2.0/programming-guide.html#accumulators>? >> > > As far a

Re: Closing over a var with changing value in Streaming application

2015-01-20 Thread Tobias Pfeiffer
Hi, On Wed, Jan 21, 2015 at 4:46 PM, Akhil Das wrote: > How about using accumulators > ? > As far as I understand, they solve the part of the problem that I am not worried about, namely increasing the counter. I was more wo

Closing over a var with changing value in Streaming application

2015-01-20 Thread Tobias Pfeiffer
Hi, I am developing a Spark Streaming application where I want every item in my stream to be assigned a unique, strictly increasing Long. My input data already has RDD-local integers (from 0 to N-1) assigned, so I am doing the following: var totalNumberOfItems = 0L // update the keys of the s

Re: MatchError in JsonRDD.toLong

2015-01-19 Thread Tobias Pfeiffer
Hi, On Fri, Jan 16, 2015 at 6:14 PM, Wang, Daoyuan wrote: > The second parameter of jsonRDD is the sampling ratio when we infer schema. > OK, I was aware of this, but I guess I understand the problem now. My sampling ratio is so low that I only see the Long values of data items and infer it's a

Re: If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-19 Thread Tobias Pfeiffer
Hi, On Sat, Jan 17, 2015 at 3:37 AM, Peng Cheng wrote: > I'm talking about RDD1 (not persisted or checkpointed) in this situation: > > ...(somewhere) -> RDD1 -> RDD2 > || > V V >

Re: How to get the master URL at runtime inside driver program?

2015-01-19 Thread Tobias Pfeiffer
Hi, On Sun, Jan 18, 2015 at 11:08 AM, guxiaobo1982 wrote: > > Driver programs submitted by the spark-submit script will get the runtime > spark master URL, but how it get the URL inside the main method when > creating the SparkConf object? > The master will be stored in the spark.master property

Re: Determine number of running executors

2015-01-19 Thread Tobias Pfeiffer
Hi, On Sat, Jan 17, 2015 at 3:05 AM, Shuai Zheng wrote: > > Can you share more information about how do you do that? I also have > similar question about this. > Not very proud about it ;-), but here you go: // find the number of workers available to us. val _runCmd = scala.util.Properties.prop

Re: MatchError in JsonRDD.toLong

2015-01-16 Thread Tobias Pfeiffer
Hi, On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan wrote: > > Can you provide how you create the JsonRDD? > This should be reproducible in the Spark shell: - import org.apache.spark.sql._ val sqlc = new SparkContext(sc) val rdd = sc.parall

Re: MatchError in JsonRDD.toLong

2015-01-16 Thread Tobias Pfeiffer
Hi again, On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer wrote: > Now I'm wondering where this comes from (I haven't touched this component > in a while, nor upgraded Spark etc.) [...] > So the reason that the error is showing up now is that suddenly data from a different

MatchError in JsonRDD.toLong

2015-01-15 Thread Tobias Pfeiffer
Hi, I am experiencing a weird error that suddenly popped up in my unit tests. I have a couple of HDFS files in JSON format and my test is basically creating a JsonRDD and then issuing a very simple SQL query over it. This used to work fine, but now suddenly I get: 15:58:49.039 [Executor task laun

Re: Testing if an RDD is empty?

2015-01-15 Thread Tobias Pfeiffer
Hi, On Fri, Jan 16, 2015 at 7:31 AM, freedafeng wrote: > > I myself saw many times that my app threw out exceptions because an empty > RDD cannot be saved. This is not big issue, but annoying. Having a cheap > solution testing if an RDD is empty would be nice if there is no such thing > available

Re: Serializability: for vs. while loops

2015-01-15 Thread Tobias Pfeiffer
Aaron, thanks for your mail! On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson wrote: > Scala for-loops are implemented as closures using anonymous inner classes > [...] > While loops, on the other hand, involve none of this trickery, and > everyone is happy. > Ah, I was suspecting something lik

Serializability: for vs. while loops

2015-01-14 Thread Tobias Pfeiffer
Hi, sorry, I don't like questions about serializability myself, but still... Can anyone give me a hint why for (i <- 0 to (maxId - 1)) { ... } throws a NotSerializableException in the loop body while var i = 0 while (i < maxId) { // same code as in the for loop i += 1 } work

Re:

2015-01-14 Thread Tobias Pfeiffer
Hi, On Thu, Jan 15, 2015 at 12:23 AM, Ted Yu wrote: > > On Wed, Jan 14, 2015 at 6:58 AM, Jianguo Li > wrote: > >> I am using Spark-1.1.1. When I used "sbt test", I ran into the following >> exceptions. Any idea how to solve it? Thanks! I think somebody posted this >> question before, but no on

Re: *ByKey aggregations: performance + order

2015-01-14 Thread Tobias Pfeiffer
Sean, thanks for your message. On Wed, Jan 14, 2015 at 8:36 PM, Sean Owen wrote: > On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer wrote: > > OK, it seems like even on a local machine (with no network overhead), the > > groupByKey version is about 5 times slower than a

Re: *ByKey aggregations: performance + order

2015-01-13 Thread Tobias Pfeiffer
Hi, On Wed, Jan 14, 2015 at 12:11 PM, Tobias Pfeiffer wrote: > > Now I don't know (yet) if all of the functions I want to compute can be > expressed in this way and I was wondering about *how much* more expensive > we are talking about. > OK, it seems like even on a lo

*ByKey aggregations: performance + order

2015-01-13 Thread Tobias Pfeiffer
Hi, I have an RDD[(Long, MyData)] where I want to compute various functions on lists of MyData items with the same key (this will in general be a rather short lists, around 10 items per key). Naturally I was thinking of groupByKey() but was a bit intimidated by the warning: "This operation may be

Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer
Hi again, On Wed, Jan 14, 2015 at 10:06 AM, Tobias Pfeiffer wrote: > If you think of > items.map(x => /* throw exception */).count() > then even though the count you want to get does not necessarily require > the evaluation of the function in map() (i.e., the number is the s

Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer
Hi, On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya wrote: > Use the mapPartitions function. It returns an iterator to each partition. > Then just get that length by converting to an array. > On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton wrote: > Doesn’t that just read in all the values? The

Re: RDD Moving Average

2015-01-08 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 9:47 AM, Asim Jalis wrote: > One approach I was considering was to use mapPartitions. It is > straightforward to compute the moving average over a partition, except for > near the end point. Does anyone see how to fix that? > Well, I guess this is not a perfect use ca

Re: Create DStream consisting of HDFS and (then) Kafka data

2015-01-07 Thread Tobias Pfeiffer
Hi, On Thu, Jan 8, 2015 at 2:19 PM, wrote: > dstream processing bulk HDFS data- is something I don't feel is super well socialized yet, & fingers crossed that base gets built up a little > more. Just out of interest (and hoping not to hijack my own thread), why are you not doing plain RDD pro

Create DStream consisting of HDFS and (then) Kafka data

2015-01-07 Thread Tobias Pfeiffer
Hi, I have a setup where data from an external stream is piped into Kafka and also written to HDFS periodically for long-term storage. Now I am trying to build an application that will first process the HDFS files and then switch to Kafka, continuing with the first item that was not yet in HDFS. (

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 11:13 AM, Riginos Samaras wrote: > exactly thats what I'm looking for, my code is like this: > //code > > val users_map = users_file.map{ s => > > val parts = s.split(",") > > (parts(0).toInt, parts(1)) > > }.distinct > > //code > > > but i get the error: > > error: va

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 10:47 AM, Riginos Samaras wrote: > Yes something like this. Can you please give me an example to create a Map? > That depends heavily on the shape of your input file. What about something like: (for (line <- Source.fromFile(filename).getLines()) { val items = line.

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, it looks to me as if you need the whole user database on every node, so maybe put the id->name information as a Map[Id, String] in a broadcast variable and then do something like recommendations.map(line => { line.map(uid => usernames(uid)) }) or so? Tobias

Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Tobias Pfeiffer
Hi, On Tue, Jan 6, 2015 at 11:24 PM, Todd wrote: > I am a bit new to Spark, except that I tried simple things like word > count, and the examples given in the spark sql programming guide. > Now, I am investigating the internals of Spark, but I think I am almost > lost, because I could not grasp

Re: Add StructType column to SchemaRDD

2015-01-05 Thread Tobias Pfeiffer
Hi Michael, On Tue, Jan 6, 2015 at 3:43 PM, Michael Armbrust wrote: > Oh sorry, I'm rereading your email more carefully. Its only because you > have some setup code that you want to amortize? > Yes, exactly that. Concerning the docs, I'd be happy to contribute, but I don't really understand w

Add StructType column to SchemaRDD

2015-01-05 Thread Tobias Pfeiffer
Hi, I have a SchemaRDD where I want to add a column with a value that is computed from the rest of the row. As the computation involves a network operation and requires setup code, I can't use "SELECT *, myUDF(*) FROM rdd", but I wanted to use a combination of: - get schema of input SchemaRDD

Re: serialization issue with mapPartitions

2014-12-25 Thread Tobias Pfeiffer
Hi, On Fri, Dec 26, 2014 at 10:13 AM, ey-chih chow wrote: > I should rephrase my question as follows: > > How to use the corresponding Hadoop Configuration of a HadoopRDD in > defining > a function as an input parameter to the MapPartitions function? > Well, you could try to pull the `val confi

Re: Discourse: A proposed alternative to the Spark User list

2014-12-25 Thread Tobias Pfeiffer
Nick, uh, I would have expected a rather heated discussion, but the opposite seems to be the case ;-) Independent of my personal preferences w.r.t. usability, habits etc., I think it is not good for a software/tool/framework if questions and discussions are spread over too many places. I guess ev

Re: serialization issue with mapPartitions

2014-12-25 Thread Tobias Pfeiffer
Hi, On Fri, Dec 26, 2014 at 1:32 AM, ey-chih chow wrote: > > I got some issues with mapPartitions with the following piece of code: > > val sessions = sc > .newAPIHadoopFile( > "... path to an avro file ...", > classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[ByteBuf

Re: unable to do group by with 1st column

2014-12-25 Thread Tobias Pfeiffer
Hi, On Fri, Dec 26, 2014 at 5:22 AM, Amit Behera wrote: > > How can I do it? Please help me to do. > Have you considered using groupByKey? http://spark.apache.org/docs/latest/programming-guide.html#transformations Tobias

Re: SchemaRDD to RDD[String]

2014-12-24 Thread Tobias Pfeiffer
Hi, On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid wrote: > > I want to convert a schemaRDD into RDD of String. How can we do that? > > Currently I am doing like this which is not converting correctly no > exception but resultant strings are empty > > here is my code > Hehe, this is the most Jav

Re: How to run an action and get output?

2014-12-23 Thread Tobias Pfeiffer
Hi, On Fri, Dec 19, 2014 at 6:53 PM, Ashic Mahtab wrote: > > val doSomething(entry:SomeEntry, session:Session) : SomeOutput = { > val result = session.SomeOp(entry) > SomeOutput(entry.Key, result.SomeProp) > } > > I could use a transformation for rdd.map, but in case of failures, the map

Re: Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer
Hi again, On Thu, Dec 18, 2014 at 6:43 PM, Tobias Pfeiffer wrote: > > tmpRdd.foreachPartition(iter => { > iter.map(item => { > println("xyz: " + item) > }) > }) > Uh, with iter.foreach(...) it works... the

Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer
Hi, I have the following code in my application: tmpRdd.foreach(item => { println("abc: " + item) }) tmpRdd.foreachPartition(iter => { iter.map(item => { println("xyz: " + item) }) }) In the output, I see only the "abc" pr

Re: spark streaming kafa best practices ?

2014-12-17 Thread Tobias Pfeiffer
Hi, On Thu, Dec 18, 2014 at 3:08 AM, Patrick Wendell wrote: > > On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas > wrote: > > I was wondering why one would choose for rdd.map vs rdd.foreach to > execute a > > side-effecting function on an RDD. > Personally, I like to get the count of processed item

Re: Spark SQL DSL for joins?

2014-12-16 Thread Tobias Pfeiffer
Jerry, On Wed, Dec 17, 2014 at 3:35 PM, Jerry Raj wrote: > > Another problem with the DSL: > > t1.where('term == "dmin").count() returns zero. Looks like you need ===: https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD Tobias

Re: Adding a column to a SchemaRDD

2014-12-14 Thread Tobias Pfeiffer
Nathan, On Fri, Dec 12, 2014 at 3:11 PM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: > > I can see how to do it if can express the added values in SQL - just run > "SELECT *,valueCalculation AS newColumnName FROM table" > > I've been searching all over for how to do this if my added val

Re: Running spark-submit from a remote machine using a YARN application

2014-12-14 Thread Tobias Pfeiffer
Hi, On Fri, Dec 12, 2014 at 7:01 AM, ryaminal wrote: > > Now our solution is to make a very simply YARN application which execustes > as its command "spark-submit --master yarn-cluster > s3n://application/jar.jar > ...". This seemed so simple and elegant, but it has some weird issues. We > get "N

Re: spark-submit on YARN is slow

2014-12-08 Thread Tobias Pfeiffer
Hi, On Tue, Dec 9, 2014 at 4:39 AM, Sandy Ryza wrote: > > Can you try using the YARN Fair Scheduler and set > yarn.scheduler.fair.continuous-scheduling-enabled to true? > I'm using Cloudera 5.2.0 and my configuration says yarn.resourcemanager.scheduler.class = org.apache.hadoop.yarn.server.re

Count-based windows

2014-12-08 Thread Tobias Pfeiffer
Hi, I am interested in building an application that uses sliding windows not based on the time when the item was received, but on either * a timestamp embedded in the data, or * a count (like: every 10 items, look at the last 100 items). Also, I want to do this on stream data received from Kafka,

Re: spark-submit on YARN is slow

2014-12-07 Thread Tobias Pfeiffer
Hi, thanks for your responses! On Sat, Dec 6, 2014 at 4:22 AM, Sandy Ryza wrote: > > What version are you using? In some recent versions, we had a couple of > large hardcoded sleeps on the Spark side. > I am using Spark 1.1.1. As Andrew mentioned, I guess most of the 10 seconds waiting time p

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 3:51 PM, Rahul Bindlish < rahul.bindl...@nectechnologies.in> wrote: > > 1. Copy csv files in current directory. > 2. Open spark-shell from this directory. > 3. Run "one_scala" file which will create object-files from csv-files in > current directory. > 4. Restart spar

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish < rahul.bindl...@nectechnologies.in> wrote: > > I have done so thats why spark is able to load objectfile [e.g. person_obj] > and spark has maintained serialVersionUID [person_obj]. > > Next time when I am trying to load another objectfile [e.g

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 1:29 PM, Rahul Bindlish < rahul.bindl...@nectechnologies.in> wrote: > > I have created objectfiles [person_obj,office_obj] from > csv[person_csv,office_csv] files using case classes[person,office] with API > (saveAsObjectFile) > > Now I restarted spark-shell and load

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
On Fri, Dec 5, 2014 at 12:53 PM, Rahul Bindlish < rahul.bindl...@nectechnologies.in> wrote: > Is it a limitation that spark does not support more than one case class at > a > time. > What do you mean? I do not have the slightest idea what you *could* possibly mean by "to support a case class". T

Re: Stateful mapPartitions

2014-12-04 Thread Tobias Pfeiffer
Hi, On Fri, Dec 5, 2014 at 3:56 AM, Akshat Aranya wrote: > Is it possible to have some state across multiple calls to mapPartitions > on each partition, for instance, if I want to keep a database connection > open? > If you're using Scala, you can use a singleton object, this will exist once pe

Re: Market Basket Analysis

2014-12-04 Thread Tobias Pfeiffer
Hi, On Thu, Dec 4, 2014 at 11:58 PM, Rohit Pujari wrote: > > I'd like to do market basket analysis using spark, what're my options? > To do it or not to do it ;-) Seriously, could you elaborate a bit on what you want to know? Tobias

Re: netty on classpath when using spark-submit

2014-12-03 Thread Tobias Pfeiffer
Markus, On Tue, Nov 11, 2014 at 10:40 AM, M. Dale wrote: > > I never tried to use this property. I was hoping someone else would jump > in. When I saw your original question I remembered that Hadoop has > something similar. So I searched and found the link below. A quick JIRA > search seems to

spark-submit on YARN is slow

2014-12-03 Thread Tobias Pfeiffer
Hi, I am using spark-submit to submit my application to YARN in "yarn-cluster" mode. I have both the Spark assembly jar file as well as my application jar file put in HDFS and can see from the logging output that both files are used from there. However, it still takes about 10 seconds for my appli

Re: Best way to have some singleton per worker

2014-12-03 Thread Tobias Pfeiffer
Hi, On Thu, Dec 4, 2014 at 2:59 AM, Ashic Mahtab wrote: > > I've been doing this with foreachPartition (i.e. have the parameters for > creating the singleton outside the loop, do a foreachPartition, create the > instance, loop over entries in the partition, close the partition), but > it's quite

Re: textFileStream() issue?

2014-12-03 Thread Tobias Pfeiffer
Hi, On Wed, Dec 3, 2014 at 5:31 PM, Bahubali Jain wrote: > > I am trying to use textFileStream("some_hdfs_location") to pick new files > from a HDFS location.I am seeing a pretty strange behavior though. > textFileStream() is not detecting new files when I "move" them from a > location with in hd

Re: Spark SQL UDF returning a list?

2014-12-03 Thread Tobias Pfeiffer
Hi, On Wed, Dec 3, 2014 at 4:31 PM, Jerry Raj wrote: > > Exception in thread "main" java.lang.RuntimeException: [1.57] failure: > ``('' expected but identifier myudf found > > I also tried returning a List of Ints, that did not work either. Is there > a way to write a UDF that returns a list? >

Does count() evaluate all mapped functions?

2014-12-03 Thread Tobias Pfeiffer
Hi, I have an RDD and a function that should be called on every item in this RDD once (say it updates an external database). So far, I used rdd.map(myFunction).count() or rdd.mapPartitions(iter => iter.map(myFunction)) but I am wondering if this always triggers the call of myFunction in both c

Re: Passing Java Options to Spark AM launching

2014-12-01 Thread Tobias Pfeiffer
Hi, have a look at the documentation for spark.driver.extraJavaOptions (which seems to have disappeared since I looked it up last week) and spark.executor.extraJavaOptions at < http://spark.apache.org/docs/latest/configuration.html#runtime-environment>. Tobias

Re: kafka pipeline exactly once semantics

2014-11-30 Thread Tobias Pfeiffer
Josh, On Sun, Nov 30, 2014 at 10:17 PM, Josh J wrote: > > I would like to setup a Kafka pipeline whereby I write my data to a single > topic 1, then I continue to process using spark streaming and write the > transformed results to topic2, and finally I read the results from topic 2. > Not reall

Re: Determine number of running executors

2014-11-25 Thread Tobias Pfeiffer
Hi, Thanks for your help! Sandy, I had a bit of trouble finding the spark.executor.cores property. (It wasn't there although its value should have been 2.) I ended up throwing regular expressions on scala.util.Properties.propOrElse("sun.java.command", ""), which worked surprisingly well ;-) Than

Re: Setup Remote HDFS for Spark

2014-11-24 Thread Tobias Pfeiffer
Hi, On Sat, Nov 22, 2014 at 12:13 AM, EH wrote: > Unfortunately whether it is possible to have both Spark and HDFS running on > the same machine is not under our control. :( Right now we have Spark and > HDFS running in different machines. In this case, is it still possible to > hook up a rem

Re: Is spark streaming +MlLib for online learning?

2014-11-24 Thread Tobias Pfeiffer
Hi, On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact wrote: > > I seemed to read somewhere that spark is still batch learning, but spark > streaming could allow online learning. > Spark doesn't do Machine Learning itself, but MLlib does. MLlib currently can do online learning only for linear regr

Determine number of running executors

2014-11-20 Thread Tobias Pfeiffer
Hi, when running on YARN, is there a way for the Spark driver to know how many executors, cores per executor etc. there are? I want to know this so I can repartition to a good number. Thanks Tobias

spark-submit and logging

2014-11-20 Thread Tobias Pfeiffer
Hi, I am using spark-submit to submit my application jar to a YARN cluster. I want to deliver a single jar file to my users, so I would like to avoid to tell them "also, please put that log4j.xml file somewhere and add that path to the spark-submit command". I thought it would be sufficient that

Re: Cannot access data after a join (error: value _1 is not a member of Product with Serializable)

2014-11-19 Thread Tobias Pfeiffer
Hi, it looks what you are trying to use as a Tuple cannot be inferred to be a Tuple from the compiler. Try to add type declarations and maybe you will see where things fail. Tobias

  1   2   3   >