We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.
Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a
certain point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can achieve t
incoming event stream.
>
> Also, what do you mean by "No Back PRessure" ?
>
>
>
>
>
> On Wednesday, 17 June 2015 11:57 AM, Enno Shioji
> wrote:
>
>
> We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.
>
> Some of the important dra
( total amount , total min , total data
> etc ) so that i know how much i accumulated at any given point as events
> for same phone can go to any node / executor.
>
> Can some one please tell me how can i achieve this is spark as in storm i
> can have a bolt which can do this ?
On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni wrote:
> As per my Best Understanding Spark Streaming offer Exactly once processing
> , is this achieve only through updateStateByKey or there is another way to
> do the same.
>
> Ashish
>
> On Wed, Jun 17, 2015 at 8:48 AM, Enn
00 PM, Enno Shioji wrote:
> So Spark (not streaming) does offer exactly once. Spark Streaming however,
> can only do exactly once semantics *if the update operation is idempotent*.
> updateStateByKey's update operation is idempotent, because it completely
> replaces the previous state.
free but
>> above it is $1 a min.
>>
>> How do i maintain a shared state ( total amount , total min , total data
>> etc ) so that i know how much i accumulated at any given point as events
>> for same phone can go to any node / executor.
>>
>> Can some one ple
at 2:09 PM, Ashish Soni wrote:
> Stream can also be processed in micro-batch / batches which is the main
> reason behind Spark Steaming so what is the difference ?
>
> Ashish
>
> On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji wrote:
>
>> PS just to elaborate on my first
on EMR or storm for fault
> tolerance and load balancing. Is it a correct approach?
> On 17 Jun 2015 23:07, "Enno Shioji" wrote:
>
>> Hi Ayan,
>>
>> Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
>> should be able to us
ctly-once/#7
>
>
> https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
>
>
>
> On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji wrote:
>
>> AFAIK KCL is *supposed* to provide fault tolerance and load balancing
>> (plus additi
you'd have to make
> sure that updates to your own internal state (e.g. reduceByKeyAndWindow)
> are exactly-once too.
>
> Matei
>
>
> On Jun 17, 2015, at 8:26 AM, Enno Shioji wrote:
>
> The thing is, even with that improvement, you still have to make updates
&g
nector.
> Before Spark 1.3.0 release one Spark worker would get all the streamed
> messages. We had to re-partition to distribute the processing.
>
>
>
> From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel
> reads from Kafka streamed to Spark workers
Storm / Trident guide, even they call this exact
> conditional updating of Trident State as "transactional" operation. See
> "transactional spout" in the Trident State guide -
> https://storm.apache.org/documentation/Trident-state
>
> In the end, I am totally op
You are probably listening to the sample stream, and THEN filtering. This
means you listen to 1% of the twitter stream, and then looking for the
tweet by Bloomberg, so there is a very good chance you don't see the
particular tweet.
In order to get all Bloomberg related tweets, you must connect to
On Jul 23, 2015, at 4:17 PM, Enno Shioji wrote:
>
> You are probably listening to the sample stream, and THEN filtering.
> This means you listen to 1% of the twitter stream, and then looking for the
> tweet by Bloomberg, so there is a very good chance you don't see the
> part
If you start parallel Twitter streams, you will be in breach of their TOS.
They allow a small number of parallel stream in practice, but if you do it
on massive scale they'll ban you (I'm speaking from experience ;) ).
If you really need that level of data, you need to talk to a company called
Gni
If you want to do it through streaming API you have to pay Gnip; it's not free.
You can go through non-streaming Twitter API and convert it to stream yourself
though.
> On 4 Aug 2015, at 09:29, Sadaf wrote:
>
> Hi
> Is there any way to get all old tweets since when the account was created
>
Hi,
Suppose I have a job that uses some native libraries. I can launch
executors using a Docker container and everything is fine.
Now suppose I have some other job that uses some other native libraries
(and let's assume they just can't co-exist in the same docker image), but I
want to execute tho
> 1
You have 4 CPU core and 34 threads (system wide, you likely have many more,
by the way).
Think of it as having 4 espresso machine and 34 baristas. Does the fact
that you have only 4 espresso machine mean you can only have 4 baristas? Of
course not, there's plenty more work other than making esp
ake it outside of my map
> operation then it throws Serializable Exceptions (Caused by:
> java.io.NotSerializableException:
> com.fasterxml.jackson.module.scala.modifiers.SetTypeModifier).
>
> Thanks
> Best Regards
>
> On Sat, Feb 14, 2015 at 7:11 PM, Enno Shioji wrote:
&
gt;>object Holder extends Serializable {
>> @transient lazy val mapper = new ObjectMapper() with
>> ScalaObjectMapper
>> mapper.registerModule(DefaultScalaModule)
>> }
>>
>> val jsonStream = myDStream.map(x=> {
>> Holder
t way to do this in Spark, i know if i use sparkSQL with schemaRDD
> and all it will be much faster, but i need that in SparkStreaming.
>
> Thanks
> Best Regards
>
> On Sat, Feb 14, 2015 at 8:04 PM, Enno Shioji wrote:
>
>> I see. I'd really benchmark how the parsing perfor
My employer (adform.com) would like to use the Spark logo in a recruitment
event (to indicate that we are using Spark in our company). I looked in the
Spark repo (https://github.com/apache/spark/tree/master/docs/img) but
couldn't find a vector format.
Is a higher-res or vector format version avail
Is anybody experiencing this? It looks like a bug in JetS3t to me, but
thought I'd sanity check before filing an issue.
I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL
("s3://fake-test/1234").
The code does write to S3, but with double forward slashes
27;m wondering, also why jets3tfilesystem is the AbstractFileSystem used by
> so many - is that the standard impl for storing using AbstractFileSystem
> interface?
>
> On Dec 23, 2014, at 6:06 AM, Enno Shioji wrote:
>
> Is anybody experiencing this? It looks like a bug in JetS3t
ᐧ
I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely
to be deprecated anyway in favor of s3n.
Also the comment section notes that Amazon has implemented an EmrFileSystem
for S3 which is built using AWS SDK rather than JetS3t.
On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji
Hi, I'm facing a weird issue. Any help appreciated.
When I execute the below code and compare "input" and "output", each record
in the output has some extra trailing data appended to it, and hence
corrupted. I'm just reading and writing, so the input and output should be
exactly the same.
I'm usi
This poor soul had the exact same problem and solution:
http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile
ᐧ
On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji wrote:
> Hi, I'm facing a weird issue. Any help appreciated.
&g
Also the job was deployed from the master machine in the cluster.
ᐧ
On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji wrote:
> Oh sorry that was a edit mistake. The code is essentially:
>
> val msgStream = kafkaStream
>.map { case (k, v) => v
iver? and are you running this from a machine that's not in
> your Spark cluster? Then in client mode you're shipping data back to a
> less-nearby machine, compared to with cluster mode. That could explain
> the bottleneck.
>
> On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji w
e number of
> executors/cores requested?
>
> TD
>
> On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji wrote:
>
>> Also the job was deployed from the master machine in the cluster.
>>
>> On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji wrote:
>>
>>
I have a hack to gather custom application metrics in a Streaming job, but
I wanted to know if there is any better way of doing this.
My hack consists of this singleton:
object Metriker extends Serializable {
@transient lazy val mr: MetricRegistry = {
val metricRegistry = new MetricRegistry
etricsSystem to
> build your own and configure it in metrics.properties with class name to
> let it loaded by metrics system, for the details you can refer to
> http://spark.apache.org/docs/latest/monitoring.html or source code.
>
>
>
> Thanks
>
> Jerry
>
>
>
&g
Hi,
I extended org.apache.spark.streaming.TestSuiteBase for some testing, and I
was able to run this test fine:
test("Sliding window join with 3 second window duration") {
val input1 =
Seq(
Seq("req1"),
Seq("req2", "req3"),
Seq(),
Seq("req4", "req5", "req6"),
.github.com/ibuenros/9b94736c2bad2f4b8e23
ᐧ
On Mon, Jan 5, 2015 at 2:56 PM, Enno Shioji wrote:
> Hi Gerard,
>
> Thanks for the answer! I had a good look at it, but I couldn't figure out
> whether one can use that to emit metrics from your application code.
>
> Suppose I wanted to m
Had the same issue. I can't remember what the issue was but this works:
libraryDependencies ++= {
val sparkVersion = "1.2.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark"
d just choosing the project
> folder and I get the same result. Not using gen-idea in sbt.
>
>
>
> On Wed, Jan 14, 2015 at 8:52 AM, Jay Vyas
> wrote:
>
>> I find importing a working SBT project into IntelliJ is the way to
>> go.
>>
>> How did you load th
36 matches
Mail list logo