Yes On Wed, Aug 26, 2015 at 10:23 AM, Chen Song <chen.song...@gmail.com> wrote:
> Thanks Cody. > > Are you suggesting to put the cache in global context in each executor > JVM, in a Scala object for example. Then have a scheduled task to refresh > the cache (or triggered by the expiry if Guava)? > > Chen > > On Wed, Aug 26, 2015 at 10:51 AM, Cody Koeninger <c...@koeninger.org> > wrote: > >> If your data only changes every few days, why not restart the job every >> few days, and just broadcast the data? >> >> Or you can keep a local per-jvm cache with an expiry (e.g. guava cache) >> to avoid many mysql reads >> >> On Wed, Aug 26, 2015 at 9:46 AM, Chen Song <chen.song...@gmail.com> >> wrote: >> >>> Piggyback on this question. >>> >>> I have a similar use case but a bit different. My job is consuming a >>> stream from Kafka and I need to join the Kafka stream with some reference >>> table from MySQL (kind of data validation and enrichment). I need to >>> process this stream every 1 min. The data in MySQL is not changed very >>> often, maybe once a few days. >>> >>> So my requirement is: >>> >>> * I cannot easily use broadcast variable because the data does change, >>> although not very often. >>> * I am not sure if it is good practice to read data from MySQL in every >>> batch (in my case, 1 min). >>> >>> Anyone has done this before, any suggestions and feedback is appreciated. >>> >>> Chen >>> >>> >>> On Sun, Jul 5, 2015 at 11:50 AM, Ashic Mahtab <as...@live.com> wrote: >>> >>>> If it is indeed a reactive use case, then Spark Streaming would be a >>>> good choice. >>>> >>>> One approach worth considering - is it possible to receive a message >>>> via kafka (or some other queue). That'd not need any polling, and you could >>>> use standard consumers. If polling isn't an issue, then writing a custom >>>> receiver will work fine. The way a receiver works is this: >>>> >>>> * Your receiver has a receive() function, where you'd typically start a >>>> loop. In your loop, you'd fetch items, and call store(entry). >>>> * You control everything in the receiver. If you're listening on a >>>> queue, you receive messages, store() and ack your queue. If you're polling, >>>> it's up to you to ensure delays between db calls. >>>> * The things you store() go on to make up the rdds in your DStream. So, >>>> intervals, windowing, etc. apply to those. The receiver is the boundary >>>> between your data source and the DStream RDDs. In other words, if your >>>> interval is 15 seconds with no windowing, then the things that went to >>>> store() every 15 seconds are bunched up into an RDD of your DStream. That's >>>> kind of a simplification, but should give you the idea that your "db >>>> polling" interval and streaming interval are not tied together. >>>> >>>> -Ashic. >>>> >>>> ------------------------------ >>>> Date: Mon, 6 Jul 2015 01:12:34 +1000 >>>> Subject: Re: JDBC Streams >>>> From: guha.a...@gmail.com >>>> To: as...@live.com >>>> CC: ak...@sigmoidanalytics.com; user@spark.apache.org >>>> >>>> >>>> Hi >>>> >>>> Thanks for the reply. here is my situation: I hve a DB which enbles >>>> synchronus CDC, think this as a DBtrigger which writes to a taable with >>>> "changed" values as soon as something changes in production table. My job >>>> will need to pick up the data "as soon as it arrives" which can be every 1 >>>> min interval. Ideally it will pick up the changes, transform it into a >>>> jsonand puts it to kinesis. In short, I am emulating a Kinesis producer >>>> with a DB source (dont even ask why, lets say these are the constraints :) >>>> ) >>>> >>>> Please advice (a) is spark a good choice here (b) whats your >>>> suggestion either way. >>>> >>>> I understand I can easily do it using a simple java/python app but I am >>>> little worried about managing scaling/fault tolerance and thats where my >>>> concern is. >>>> >>>> TIA >>>> Ayan >>>> >>>> On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote: >>>> >>>> Hi Ayan, >>>> How "continuous" is your workload? As Akhil points out, with streaming, >>>> you'll give up at least one core for receiving, will need at most one more >>>> core for processing. Unless you're running on something like Mesos, this >>>> means that those cores are dedicated to your app, and can't be leveraged by >>>> other apps / jobs. >>>> >>>> If it's something periodic (once an hour, once every 15 minutes, etc.), >>>> then I'd simply write a "normal" spark application, and trigger it >>>> periodically. There are many things that can take care of that - sometimes >>>> a simple cronjob is enough! >>>> >>>> ------------------------------ >>>> Date: Sun, 5 Jul 2015 22:48:37 +1000 >>>> Subject: Re: JDBC Streams >>>> From: guha.a...@gmail.com >>>> To: ak...@sigmoidanalytics.com >>>> CC: user@spark.apache.org >>>> >>>> >>>> Thanks Akhil. In case I go with spark streaming, I guess I have to >>>> implment a custom receiver and spark streaming will call this receiver >>>> every batch interval, is that correct? Any gotcha you see in this plan? >>>> TIA...Best, Ayan >>>> >>>> On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com> >>>> wrote: >>>> >>>> If you want a long running application, then go with spark streaming >>>> (which kind of blocks your resources). On the other hand, if you use job >>>> server then you can actually use the resources (CPUs) for other jobs also >>>> when your dbjob is not using them. >>>> >>>> Thanks >>>> Best Regards >>>> >>>> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>> Hi All >>>> >>>> I have a requireent to connect to a DB every few minutes and bring data >>>> to HBase. Can anyone suggest if spark streaming would be appropriate for >>>> this senario or I shoud look into jobserver? >>>> >>>> Thanks in advance >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>> >>> >>> >>> -- >>> Chen Song >>> >>> >> > > > -- > Chen Song > >