Fellows,
I have a simple code.
sc.parallelize (Array (1, 4, 3, 2), 2).sortBy (i=>i).foreach (println)
This results in 2 jobs (sortBy, foreach) in Spark's application master ui.
I thought there is one to one relationship between RDD action and job. Here,
only action is foreach, so should be only on
See if https://issues.apache.org/jira/browse/SPARK-3660 helps you. My patch
has been accepted and, this enhancement is scheduled for 1.3.0.
This lets you specify initialRDD for updateStateByKey operation. Let me
know if you need any information.
On Sun, Feb 22, 2015 at 5:21 PM, Tobias Pfeiffer w
It is a Streaming application, so how/when do you plan to access the
accumulator on driver?
On Tue, Jan 27, 2015 at 6:48 PM, Tobias Pfeiffer wrote:
> Hi,
>
> thanks for your mail!
>
> On Wed, Jan 28, 2015 at 11:44 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> That seems reasonab
Hi Adam,
I have following scala actor based code to do graceful shutdown:
class TimerActor (val timeout : Long, val who : Actor) extends Actor {
def act {
reactWithin (timeout) {
case TIMEOUT => who ! SHUTDOWN
}
}
}
class SSCReactor (val ssc : StreamingContext
Entire file in a window.
On Mon, Nov 10, 2014 at 9:20 AM, Saiph Kappa wrote:
> Hi,
>
> In my application I am doing something like this "new
> StreamingContext(sparkConf, Seconds(10)).textFileStream("logs/")", and I
> get some unknown exceptions when I copy a file with about 800 MB to that
> fol
Hello,
Is there a way to print the dependency graph of complete program or RDD/DStream
as a DOT file? It would be very helpful to have such a thing.
Thanks,
-Soumitra.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
Hello,
I am debugging my code to find out what else to cache.
Following is a line in log:
14/10/16 12:00:01 INFO TransformedDStream: Persisting RDD 6 for time
141348600 ms to StorageLevel(true, true, false, false, 1) at time
141348600 ms
Is there a way to name a DStream? RDD has a nam
Great, it worked.
I don't have an answer what is special about SPARK_CLASSPATH vs --jars, just
found the working setting through trial an error.
- Original Message -
From: "Fengyun RAO"
To: "Soumitra Kumar"
Cc: user@spark.apache.org, u...@hbase.apache.org
S
I am writing to HBase, following are my options:
export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar
spark-submit \
--jars
/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar,/opt/cloudera/parcels/CDH/lib/hbase
to new schema.
- Original Message -
From: "Buntu Dev"
To: "Soumitra Kumar"
Cc: u...@spark.incubator.apache.org
Sent: Tuesday, October 7, 2014 10:18:16 AM
Subject: Re: Kafka->HDFS to store as Parquet format
Thanks for the info Soumitra.. its a good start for me.
I have used it to write Parquet files as:
val job = new Job
val conf = job.getConfiguration
conf.set (ParquetOutputFormat.COMPRESSION, CompressionCodecName.SNAPPY.name ())
ExampleOutputFormat.setSchema (job, MessageTypeParser.parseMessageType
(parquetSchema))
rdd saveAsNewAPIHadoopFile (rddToFile
I thought I did a good job ;-)
OK, so what is the best way to initialize updateStateByKey operation? I have
counts from previous spark-submit, and want to load that in next spark-submit
job.
- Original Message -
From: "Soumitra Kumar"
To: "spark users"
Sent:
I started with StatefulNetworkWordCount to have a running count of words seen.
I have a file 'stored.count' which contains the word counts.
$ cat stored.count
a 1
b 2
I want to initialize StatefulNetworkWordCount with the values in 'stored.count'
file, how do I do that?
I looked at the paper '
I successfully did this once.
RDD map to RDD [(ImmutableBytesWritable, KeyValue)]
then
val conf = HBaseConfiguration.create()
val job = new Job (conf, "CEF2HFile")
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable]);
job.setMapOutputValueClass (classOf[KeyValue]);
val table = new HTable(con
onStart should be non-blocking. You may try to create a thread in onStart
instead.
- Original Message -
From: "t1ny"
To: u...@spark.incubator.apache.org
Sent: Friday, September 19, 2014 1:26:42 AM
Subject: Re: Spark Streaming and ReactiveMongo
Here's what we've tried so far as a first e
I had a ton of "too many files open" errors :)
>> - Use immutable objects as far as possible. If I use mutable objects
>> within a method/class then I turn them into immutable before passing
>> onto another class/method.
>> - For logging, create a LogService o
logic.
Currently, my processing delay is lower than my dStream time window so
all is good. I get a ton of these errors in my driver logs:
14/09/16 21:17:40 ERROR LiveListenerBus: Listener JobProgressListener
threw an exception
These seem related to: https://issues.apache.org/jira/browse/SPARK-2316
Be
Hmm, no response to this thread!
Adding to it, please share experiences of building an enterprise grade product
based on Spark Streaming.
I am exploring Spark Streaming for enterprise software and am cautiously
optimistic about it. I see huge potential to improve debuggability of Spark.
-
Thanks for the pointers. I meant previous run of spark-submit.
For 1: This would be a bit more computation in every batch.
2: Its a good idea, but it may be inefficient to retrieve each value.
In general, for a generic state machine the initialization and input
sequence is critical for correctne
I had looked at that.
If I have a set of saved word counts from previous run, and want to load
that in the next run, what is the best way to do it?
I am thinking of hacking the Spark code and have an initial rdd in
StateDStream,
and use that in for the first time.
On Fri, Sep 12, 2014 at 11:04 PM
Hello,
How do I initialize StateDStream used in updateStateByKey?
-Soumitra.
I have the following code:
stream foreachRDD { rdd =>
if (rdd.take (1).size == 1) {
rdd foreachPartition { iterator =>
initDbConnection ()
iterator foreach {
write to db
Yes, that is an option.
I started with a function of batch time, and index to generate id as long. This
may be faster than generating UUID, with added benefit of sorting based on time.
- Original Message -
From: "Tathagata Das"
To: "Soumitra Kumar"
Cc: &q
would contain more than 1 billion
> records. Then you can use
>
> rdd.zipWithUniqueId().mapValues(uid => rdd.id * 1e9.toLong + uid)
>
> Just a hack ..
>
> On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
> wrote:
> > So, I guess zipWithUniqueId will be similar.
>
Then you can use
>
> rdd.zipWithUniqueId().mapValues(uid => rdd.id * 1e9.toLong + uid)
>
> Just a hack ..
>
> On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
> wrote:
> > So, I guess zipWithUniqueId will be similar.
> >
> > Is there a way to get unique
So, I guess zipWithUniqueId will be similar.
Is there a way to get unique index?
On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng wrote:
> No. The indices start at 0 for every RDD. -Xiangrui
>
> On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
> wrote:
> > Hell
Hello,
If I do:
DStream transform {
rdd.zipWithIndex.map {
Is the index guaranteed to be unique across all RDDs here?
}
}
Thanks,
-Soumitra.
.first() } )
>
> This globalCount variable will reside in the driver and will keep being
> updated after every batch.
>
> TD
>
>
> On Thu, Aug 7, 2014 at 10:16 PM, Soumitra Kumar
> wrote:
>
>> Hello,
>>
>> I want to count the number of elements in
Hello,
I want to count the number of elements in the DStream, like RDD.count() .
Since there is no such method in DStream, I thought of using DStream.count
and use the accumulator.
How do I do DStream.count() to count the number of elements in a DStream?
How do I create a shared variable in Spar
; If you will only be doing one pass through the data anyway (like running a
> count every time on the full dataset) then caching is not going to help you.
>
>
> On Tue, Feb 25, 2014 at 4:59 PM, Soumitra Kumar
> wrote:
>
>> Thanks Nick.
>>
>> How do I figure out
take the same amount of time whether cache is enabled or not.
> The second time you call count on a cached RDD, you should see that it
> takes a lot less time (assuming that the data fit in memory).
>
>
> On Tue, Feb 25, 2014 at 9:38 AM, Soumitra Kumar
> wrote:
>
>> I
> In your main method after doing an action (e.g. count in your case), call val
> totalCount = count.value.
>
>
>
>
> On Tue, Feb 25, 2014 at 9:15 AM, Soumitra Kumar
> wrote:
>
>> I have a code which reads an HBase table, and counts number of rows
>> contain
I have a code which reads an HBase table, and counts number of rows
containing a field.
def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) :
RDD[List[Array[Byte]]] = {
return rdd.flatMap(kv => {
// Set of interesting keys for this use case
val keys = Li
I have a code which reads an HBase table, and counts number of rows
containing a field.
def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) :
RDD[List[Array[Byte]]] = {
return rdd.flatMap(kv => {
// Set of interesting keys for this use case
val keys = Li
34 matches
Mail list logo