sortBy transformation shows as a job

2016-01-05 Thread Soumitra Kumar
Fellows, I have a simple code. sc.parallelize (Array (1, 4, 3, 2), 2).sortBy (i=>i).foreach (println) This results in 2 jobs (sortBy, foreach) in Spark's application master ui. I thought there is one to one relationship between RDD action and job. Here, only action is foreach, so should be only on

Re: Use Spark Streaming for Batch?

2015-02-22 Thread Soumitra Kumar
See if https://issues.apache.org/jira/browse/SPARK-3660 helps you. My patch has been accepted and, this enhancement is scheduled for 1.3.0. This lets you specify initialRDD for updateStateByKey operation. Let me know if you need any information. On Sun, Feb 22, 2015 at 5:21 PM, Tobias Pfeiffer w

Re: Error reporting/collecting for users

2015-01-27 Thread Soumitra Kumar
It is a Streaming application, so how/when do you plan to access the accumulator on driver? On Tue, Jan 27, 2015 at 6:48 PM, Tobias Pfeiffer wrote: > Hi, > > thanks for your mail! > > On Wed, Jan 28, 2015 at 11:44 AM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> That seems reasonab

Re: Stop streaming context gracefully when SIGTERM is passed

2014-12-15 Thread Soumitra Kumar
Hi Adam, I have following scala actor based code to do graceful shutdown: class TimerActor (val timeout : Long, val who : Actor) extends Actor { def act { reactWithin (timeout) { case TIMEOUT => who ! SHUTDOWN } } } class SSCReactor (val ssc : StreamingContext

Re: Question about textFileStream

2014-11-10 Thread Soumitra Kumar
Entire file in a window. On Mon, Nov 10, 2014 at 9:20 AM, Saiph Kappa wrote: > Hi, > > In my application I am doing something like this "new > StreamingContext(sparkConf, Seconds(10)).textFileStream("logs/")", and I > get some unknown exceptions when I copy a file with about 800 MB to that > fol

Print dependency graph as DOT file

2014-10-16 Thread Soumitra Kumar
Hello, Is there a way to print the dependency graph of complete program or RDD/DStream as a DOT file? It would be very helpful to have such a thing. Thanks, -Soumitra. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

How to name a DStream

2014-10-16 Thread Soumitra Kumar
Hello, I am debugging my code to find out what else to cache. Following is a line in log: 14/10/16 12:00:01 INFO TransformedDStream: Persisting RDD 6 for time 141348600 ms to StorageLevel(true, true, false, false, 1) at time 141348600 ms Is there a way to name a DStream? RDD has a nam

Re: How to add HBase dependencies and conf with spark-submit?

2014-10-16 Thread Soumitra Kumar
Great, it worked. I don't have an answer what is special about SPARK_CLASSPATH vs --jars, just found the working setting through trial an error. - Original Message - From: "Fengyun RAO" To: "Soumitra Kumar" Cc: user@spark.apache.org, u...@hbase.apache.org S

Re: How to add HBase dependencies and conf with spark-submit?

2014-10-15 Thread Soumitra Kumar
I am writing to HBase, following are my options: export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar spark-submit \ --jars /opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar,/opt/cloudera/parcels/CDH/lib/hbase

Re: Kafka->HDFS to store as Parquet format

2014-10-07 Thread Soumitra Kumar
to new schema. - Original Message - From: "Buntu Dev" To: "Soumitra Kumar" Cc: u...@spark.incubator.apache.org Sent: Tuesday, October 7, 2014 10:18:16 AM Subject: Re: Kafka->HDFS to store as Parquet format Thanks for the info Soumitra.. its a good start for me.

Re: Kafka->HDFS to store as Parquet format

2014-10-07 Thread Soumitra Kumar
I have used it to write Parquet files as: val job = new Job val conf = job.getConfiguration conf.set (ParquetOutputFormat.COMPRESSION, CompressionCodecName.SNAPPY.name ()) ExampleOutputFormat.setSchema (job, MessageTypeParser.parseMessageType (parquetSchema)) rdd saveAsNewAPIHadoopFile (rddToFile

Re: How to initialize updateStateByKey operation

2014-09-23 Thread Soumitra Kumar
I thought I did a good job ;-) OK, so what is the best way to initialize updateStateByKey operation? I have counts from previous spark-submit, and want to load that in next spark-submit job. - Original Message - From: "Soumitra Kumar" To: "spark users" Sent:

How to initialize updateStateByKey operation

2014-09-21 Thread Soumitra Kumar
I started with StatefulNetworkWordCount to have a running count of words seen. I have a file 'stored.count' which contains the word counts. $ cat stored.count a 1 b 2 I want to initialize StatefulNetworkWordCount with the values in 'stored.count' file, how do I do that? I looked at the paper '

Re: Bulk-load to HBase

2014-09-19 Thread Soumitra Kumar
I successfully did this once. RDD map to RDD [(ImmutableBytesWritable, KeyValue)] then val conf = HBaseConfiguration.create() val job = new Job (conf, "CEF2HFile") job.setMapOutputKeyClass (classOf[ImmutableBytesWritable]); job.setMapOutputValueClass (classOf[KeyValue]); val table = new HTable(con

Re: Spark Streaming and ReactiveMongo

2014-09-19 Thread Soumitra Kumar
onStart should be non-blocking. You may try to create a thread in onStart instead. - Original Message - From: "t1ny" To: u...@spark.incubator.apache.org Sent: Friday, September 19, 2014 1:26:42 AM Subject: Re: Spark Streaming and ReactiveMongo Here's what we've tried so far as a first e

Re: Stable spark streaming app

2014-09-18 Thread Soumitra Kumar
I had a ton of "too many files open" errors :) >> - Use immutable objects as far as possible. If I use mutable objects >> within a method/class then I turn them into immutable before passing >> onto another class/method. >> - For logging, create a LogService o

Re: Stable spark streaming app

2014-09-17 Thread Soumitra Kumar
logic. Currently, my processing delay is lower than my dStream time window so all is good. I get a ton of these errors in my driver logs: 14/09/16 21:17:40 ERROR LiveListenerBus: Listener JobProgressListener threw an exception These seem related to: https://issues.apache.org/jira/browse/SPARK-2316 Be

Re: Stable spark streaming app

2014-09-17 Thread Soumitra Kumar
Hmm, no response to this thread! Adding to it, please share experiences of building an enterprise grade product based on Spark Streaming. I am exploring Spark Streaming for enterprise software and am cautiously optimistic about it. I see huge potential to improve debuggability of Spark. -

Re: How to initialize StateDStream

2014-09-13 Thread Soumitra Kumar
Thanks for the pointers. I meant previous run of spark-submit. For 1: This would be a bit more computation in every batch. 2: Its a good idea, but it may be inefficient to retrieve each value. In general, for a generic state machine the initialization and input sequence is critical for correctne

Re: How to initialize StateDStream

2014-09-13 Thread Soumitra Kumar
I had looked at that. If I have a set of saved word counts from previous run, and want to load that in the next run, what is the best way to do it? I am thinking of hacking the Spark code and have an initial rdd in StateDStream, and use that in for the first time. On Fri, Sep 12, 2014 at 11:04 PM

How to initialize StateDStream

2014-09-12 Thread Soumitra Kumar
Hello, How do I initialize StateDStream used in updateStateByKey? -Soumitra.

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Soumitra Kumar
I have the following code: stream foreachRDD { rdd => if (rdd.take (1).size == 1) { rdd foreachPartition { iterator => initDbConnection () iterator foreach { write to db

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Soumitra Kumar
Yes, that is an option. I started with a function of batch time, and index to generate id as long. This may be faster than generating UUID, with added benefit of sorting based on time. - Original Message - From: "Tathagata Das" To: "Soumitra Kumar" Cc: &q

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
would contain more than 1 billion > records. Then you can use > > rdd.zipWithUniqueId().mapValues(uid => rdd.id * 1e9.toLong + uid) > > Just a hack .. > > On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar > wrote: > > So, I guess zipWithUniqueId will be similar. >

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Then you can use > > rdd.zipWithUniqueId().mapValues(uid => rdd.id * 1e9.toLong + uid) > > Just a hack .. > > On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar > wrote: > > So, I guess zipWithUniqueId will be similar. > > > > Is there a way to get unique

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng wrote: > No. The indices start at 0 for every RDD. -Xiangrui > > On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar > wrote: > > Hell

Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.

Re: Shared variable in Spark Streaming

2014-08-08 Thread Soumitra Kumar
.first() } ) > > This globalCount variable will reside in the driver and will keep being > updated after every batch. > > TD > > > On Thu, Aug 7, 2014 at 10:16 PM, Soumitra Kumar > wrote: > >> Hello, >> >> I want to count the number of elements in

Shared variable in Spark Streaming

2014-08-07 Thread Soumitra Kumar
Hello, I want to count the number of elements in the DStream, like RDD.count() . Since there is no such method in DStream, I thought of using DStream.count and use the accumulator. How do I do DStream.count() to count the number of elements in a DStream? How do I create a shared variable in Spar

Re: HBase row count

2014-02-25 Thread Soumitra Kumar
; If you will only be doing one pass through the data anyway (like running a > count every time on the full dataset) then caching is not going to help you. > > > On Tue, Feb 25, 2014 at 4:59 PM, Soumitra Kumar > wrote: > >> Thanks Nick. >> >> How do I figure out

Re: HBase row count

2014-02-25 Thread Soumitra Kumar
take the same amount of time whether cache is enabled or not. > The second time you call count on a cached RDD, you should see that it > takes a lot less time (assuming that the data fit in memory). > > > On Tue, Feb 25, 2014 at 9:38 AM, Soumitra Kumar > wrote: > >> I

Re: HBase row count

2014-02-24 Thread Soumitra Kumar
> In your main method after doing an action (e.g. count in your case), call val > totalCount = count.value. > > > > > On Tue, Feb 25, 2014 at 9:15 AM, Soumitra Kumar > wrote: > >> I have a code which reads an HBase table, and counts number of rows >> contain

HBase row count

2014-02-24 Thread Soumitra Kumar
I have a code which reads an HBase table, and counts number of rows containing a field. def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) : RDD[List[Array[Byte]]] = { return rdd.flatMap(kv => { // Set of interesting keys for this use case val keys = Li

[no subject]

2014-02-24 Thread Soumitra Kumar
I have a code which reads an HBase table, and counts number of rows containing a field. def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) : RDD[List[Array[Byte]]] = { return rdd.flatMap(kv => { // Set of interesting keys for this use case val keys = Li