Re: Spark or Storm

2015-06-16 Thread Enno Shioji
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve t

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
incoming event stream. > > Also, what do you mean by "No Back PRessure" ? > > > > > > On Wednesday, 17 June 2015 11:57 AM, Enno Shioji > wrote: > > > We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. > > Some of the important dra

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
( total amount , total min , total data > etc ) so that i know how much i accumulated at any given point as events > for same phone can go to any node / executor. > > Can some one please tell me how can i achieve this is spark as in storm i > can have a bolt which can do this ?

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni wrote: > As per my Best Understanding Spark Streaming offer Exactly once processing > , is this achieve only through updateStateByKey or there is another way to > do the same. > > Ashish > > On Wed, Jun 17, 2015 at 8:48 AM, Enn

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
00 PM, Enno Shioji wrote: > So Spark (not streaming) does offer exactly once. Spark Streaming however, > can only do exactly once semantics *if the update operation is idempotent*. > updateStateByKey's update operation is idempotent, because it completely > replaces the previous state.

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
free but >> above it is $1 a min. >> >> How do i maintain a shared state ( total amount , total min , total data >> etc ) so that i know how much i accumulated at any given point as events >> for same phone can go to any node / executor. >> >> Can some one ple

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
at 2:09 PM, Ashish Soni wrote: > Stream can also be processed in micro-batch / batches which is the main > reason behind Spark Steaming so what is the difference ? > > Ashish > > On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji wrote: > >> PS just to elaborate on my first

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
on EMR or storm for fault > tolerance and load balancing. Is it a correct approach? > On 17 Jun 2015 23:07, "Enno Shioji" wrote: > >> Hi Ayan, >> >> Admittedly I haven't done much with Kinesis, but if I'm not mistaken you >> should be able to us

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
ctly-once/#7 > > > https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html > > > > On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji wrote: > >> AFAIK KCL is *supposed* to provide fault tolerance and load balancing >> (plus additi

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
you'd have to make > sure that updates to your own internal state (e.g. reduceByKeyAndWindow) > are exactly-once too. > > Matei > > > On Jun 17, 2015, at 8:26 AM, Enno Shioji wrote: > > The thing is, even with that improvement, you still have to make updates &g

Re: RE: Spark or Storm

2015-06-18 Thread Enno Shioji
nector. > Before Spark 1.3.0 release one Spark worker would get all the streamed > messages. We had to re-partition to distribute the processing. > > > > From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel > reads from Kafka streamed to Spark workers

Re: RE: Spark or Storm

2015-06-19 Thread Enno Shioji
Storm / Trident guide, even they call this exact > conditional updating of Trident State as "transactional" operation. See > "transactional spout" in the Trident State guide - > https://storm.apache.org/documentation/Trident-state > > In the end, I am totally op

Re: Twitter4J streaming question

2015-07-23 Thread Enno Shioji
You are probably listening to the sample stream, and THEN filtering. This means you listen to 1% of the twitter stream, and then looking for the tweet by Bloomberg, so there is a very good chance you don't see the particular tweet. In order to get all Bloomberg related tweets, you must connect to

Re: Twitter4J streaming question

2015-07-23 Thread Enno Shioji
On Jul 23, 2015, at 4:17 PM, Enno Shioji wrote: > > You are probably listening to the sample stream, and THEN filtering. > This means you listen to 1% of the twitter stream, and then looking for the > tweet by Bloomberg, so there is a very good chance you don't see the > part

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Enno Shioji
If you start parallel Twitter streams, you will be in breach of their TOS. They allow a small number of parallel stream in practice, but if you do it on massive scale they'll ban you (I'm speaking from experience ;) ). If you really need that level of data, you need to talk to a company called Gni

Re: Twitter live Streaming

2015-08-04 Thread Enno Shioji
If you want to do it through streaming API you have to pay Gnip; it's not free. You can go through non-streaming Twitter API and convert it to stream yourself though. > On 4 Aug 2015, at 09:29, Sadaf wrote: > > Hi > Is there any way to get all old tweets since when the account was created >

How could one specify a Docker image for each job to be used by executors?

2016-12-05 Thread Enno Shioji
Hi, Suppose I have a job that uses some native libraries. I can launch executors using a Docker container and everything is fine. Now suppose I have some other job that uses some other native libraries (and let's assume they just can't co-exist in the same docker image), but I want to execute tho

Re: Profiling in YourKit

2015-02-07 Thread Enno Shioji
> 1 You have 4 CPU core and 34 threads (system wide, you likely have many more, by the way). Think of it as having 4 espresso machine and 34 baristas. Does the fact that you have only 4 espresso machine mean you can only have 4 baristas? Of course not, there's plenty more work other than making esp

Re: SparkStreaming Low Performance

2015-02-14 Thread Enno Shioji
ake it outside of my map > operation then it throws Serializable Exceptions (Caused by: > java.io.NotSerializableException: > com.fasterxml.jackson.module.scala.modifiers.SetTypeModifier). > > Thanks > Best Regards > > On Sat, Feb 14, 2015 at 7:11 PM, Enno Shioji wrote: &

Re: SparkStreaming Low Performance

2015-02-14 Thread Enno Shioji
gt;>object Holder extends Serializable { >> @transient lazy val mapper = new ObjectMapper() with >> ScalaObjectMapper >> mapper.registerModule(DefaultScalaModule) >> } >> >> val jsonStream = myDStream.map(x=> { >> Holder

Re: SparkStreaming Low Performance

2015-02-14 Thread Enno Shioji
t way to do this in Spark, i know if i use sparkSQL with schemaRDD > and all it will be much faster, but i need that in SparkStreaming. > > Thanks > Best Regards > > On Sat, Feb 14, 2015 at 8:04 PM, Enno Shioji wrote: > >> I see. I'd really benchmark how the parsing perfor

Is a higher-res or vector version of Spark logo available?

2015-04-23 Thread Enno Shioji
My employer (adform.com) would like to use the Spark logo in a recruitment event (to indicate that we are using Spark in our company). I looked in the Spark repo (https://github.com/apache/spark/tree/master/docs/img) but couldn't find a vector format. Is a higher-res or vector format version avail

ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
Is anybody experiencing this? It looks like a bug in JetS3t to me, but thought I'd sanity check before filing an issue. I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL ("s3://fake-test/1234"). The code does write to S3, but with double forward slashes

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
27;m wondering, also why jets3tfilesystem is the AbstractFileSystem used by > so many - is that the standard impl for storing using AbstractFileSystem > interface? > > On Dec 23, 2014, at 6:06 AM, Enno Shioji wrote: > > Is anybody experiencing this? It looks like a bug in JetS3t

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
ᐧ I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely to be deprecated anyway in favor of s3n. Also the comment section notes that Amazon has implemented an EmrFileSystem for S3 which is built using AWS SDK rather than JetS3t. On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji

Writing and reading sequence file results in trailing extra data

2014-12-30 Thread Enno Shioji
Hi, I'm facing a weird issue. Any help appreciated. When I execute the below code and compare "input" and "output", each record in the output has some extra trailing data appended to it, and hence corrupted. I'm just reading and writing, so the input and output should be exactly the same. I'm usi

[SOLVED] Re: Writing and reading sequence file results in trailing extra data

2014-12-30 Thread Enno Shioji
This poor soul had the exact same problem and solution: http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile ᐧ On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji wrote: > Hi, I'm facing a weird issue. Any help appreciated. &g

Re: Big performance difference between "client" and "cluster" deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
Also the job was deployed from the master machine in the cluster. ᐧ On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji wrote: > Oh sorry that was a edit mistake. The code is essentially: > > val msgStream = kafkaStream >.map { case (k, v) => v

Re: Big performance difference between "client" and "cluster" deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
iver? and are you running this from a machine that's not in > your Spark cluster? Then in client mode you're shipping data back to a > less-nearby machine, compared to with cluster mode. That could explain > the bottleneck. > > On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji w

Re: Big performance difference between "client" and "cluster" deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
e number of > executors/cores requested? > > TD > > On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji wrote: > >> Also the job was deployed from the master machine in the cluster. >> >> On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji wrote: >> >>

Better way of measuring custom application metrics

2015-01-03 Thread Enno Shioji
I have a hack to gather custom application metrics in a Streaming job, but I wanted to know if there is any better way of doing this. My hack consists of this singleton: object Metriker extends Serializable { @transient lazy val mr: MetricRegistry = { val metricRegistry = new MetricRegistry

Re: Better way of measuring custom application metrics

2015-01-04 Thread Enno Shioji
etricsSystem to > build your own and configure it in metrics.properties with class name to > let it loaded by metrics system, for the details you can refer to > http://spark.apache.org/docs/latest/monitoring.html or source code. > > > > Thanks > > Jerry > > > &g

TestSuiteBase based unit test using a sliding window join timesout

2015-01-07 Thread Enno Shioji
Hi, I extended org.apache.spark.streaming.TestSuiteBase for some testing, and I was able to run this test fine: test("Sliding window join with 3 second window duration") { val input1 = Seq( Seq("req1"), Seq("req2", "req3"), Seq(), Seq("req4", "req5", "req6"),

Re: Registering custom metrics

2015-01-08 Thread Enno Shioji
.github.com/ibuenros/9b94736c2bad2f4b8e23 ᐧ On Mon, Jan 5, 2015 at 2:56 PM, Enno Shioji wrote: > Hi Gerard, > > Thanks for the answer! I had a good look at it, but I couldn't figure out > whether one can use that to emit metrics from your application code. > > Suppose I wanted to m

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Enno Shioji
Had the same issue. I can't remember what the issue was but this works: libraryDependencies ++= { val sparkVersion = "1.2.0" Seq( "org.apache.spark" %% "spark-core" % sparkVersion % "provided", "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided", "org.apache.spark"

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Enno Shioji
d just choosing the project > folder and I get the same result. Not using gen-idea in sbt. > > > > On Wed, Jan 14, 2015 at 8:52 AM, Jay Vyas > wrote: > >> I find importing a working SBT project into IntelliJ is the way to >> go. >> >> How did you load th