Piggyback on this question. I have a similar use case but a bit different. My job is consuming a stream from Kafka and I need to join the Kafka stream with some reference table from MySQL (kind of data validation and enrichment). I need to process this stream every 1 min. The data in MySQL is not changed very often, maybe once a few days.
So my requirement is: * I cannot easily use broadcast variable because the data does change, although not very often. * I am not sure if it is good practice to read data from MySQL in every batch (in my case, 1 min). Anyone has done this before, any suggestions and feedback is appreciated. Chen On Sun, Jul 5, 2015 at 11:50 AM, Ashic Mahtab <as...@live.com> wrote: > If it is indeed a reactive use case, then Spark Streaming would be a good > choice. > > One approach worth considering - is it possible to receive a message via > kafka (or some other queue). That'd not need any polling, and you could use > standard consumers. If polling isn't an issue, then writing a custom > receiver will work fine. The way a receiver works is this: > > * Your receiver has a receive() function, where you'd typically start a > loop. In your loop, you'd fetch items, and call store(entry). > * You control everything in the receiver. If you're listening on a queue, > you receive messages, store() and ack your queue. If you're polling, it's > up to you to ensure delays between db calls. > * The things you store() go on to make up the rdds in your DStream. So, > intervals, windowing, etc. apply to those. The receiver is the boundary > between your data source and the DStream RDDs. In other words, if your > interval is 15 seconds with no windowing, then the things that went to > store() every 15 seconds are bunched up into an RDD of your DStream. That's > kind of a simplification, but should give you the idea that your "db > polling" interval and streaming interval are not tied together. > > -Ashic. > > ------------------------------ > Date: Mon, 6 Jul 2015 01:12:34 +1000 > Subject: Re: JDBC Streams > From: guha.a...@gmail.com > To: as...@live.com > CC: ak...@sigmoidanalytics.com; user@spark.apache.org > > > Hi > > Thanks for the reply. here is my situation: I hve a DB which enbles > synchronus CDC, think this as a DBtrigger which writes to a taable with > "changed" values as soon as something changes in production table. My job > will need to pick up the data "as soon as it arrives" which can be every 1 > min interval. Ideally it will pick up the changes, transform it into a > jsonand puts it to kinesis. In short, I am emulating a Kinesis producer > with a DB source (dont even ask why, lets say these are the constraints :) ) > > Please advice (a) is spark a good choice here (b) whats your suggestion > either way. > > I understand I can easily do it using a simple java/python app but I am > little worried about managing scaling/fault tolerance and thats where my > concern is. > > TIA > Ayan > > On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote: > > Hi Ayan, > How "continuous" is your workload? As Akhil points out, with streaming, > you'll give up at least one core for receiving, will need at most one more > core for processing. Unless you're running on something like Mesos, this > means that those cores are dedicated to your app, and can't be leveraged by > other apps / jobs. > > If it's something periodic (once an hour, once every 15 minutes, etc.), > then I'd simply write a "normal" spark application, and trigger it > periodically. There are many things that can take care of that - sometimes > a simple cronjob is enough! > > ------------------------------ > Date: Sun, 5 Jul 2015 22:48:37 +1000 > Subject: Re: JDBC Streams > From: guha.a...@gmail.com > To: ak...@sigmoidanalytics.com > CC: user@spark.apache.org > > > Thanks Akhil. In case I go with spark streaming, I guess I have to > implment a custom receiver and spark streaming will call this receiver > every batch interval, is that correct? Any gotcha you see in this plan? > TIA...Best, Ayan > > On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > > If you want a long running application, then go with spark streaming > (which kind of blocks your resources). On the other hand, if you use job > server then you can actually use the resources (CPUs) for other jobs also > when your dbjob is not using them. > > Thanks > Best Regards > > On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <guha.a...@gmail.com> wrote: > > Hi All > > I have a requireent to connect to a DB every few minutes and bring data to > HBase. Can anyone suggest if spark streaming would be appropriate for this > senario or I shoud look into jobserver? > > Thanks in advance > > -- > Best Regards, > Ayan Guha > > > > > > -- > Best Regards, > Ayan Guha > > > > > -- > Best Regards, > Ayan Guha > -- Chen Song