Here is the solution this looks perfect for me. thanks for all your help http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
On 28 July 2015 at 23:27, Jörn Franke <jornfra...@gmail.com> wrote: > Can you put some transparent cache in front of the database? Or some jdbc > proxy? > > Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele <gangele...@gmail.com> a > écrit : > >> can the source write to Kafka/Flume/Hbase in addition to Postgres? no >> it can't write ,this is due to the fact that there are many applications >> those are producing this postGreSql data.I can't really asked all the teams >> to start writing to some other source. >> >> >> velocity of the application is too high. >> >> >> >> >> >> >> On 28 July 2015 at 21:50, <santosh...@gmail.com> wrote: >> >>> Sqoop’s incremental data fetch will reduce the data size you need to >>> pull from source, but then by the time that incremental data fetch is >>> complete, is it not current again, if velocity of the data is high? >>> >>> May be you can put a trigger in Postgres to send data to the big data >>> cluster as soon as changes are made. Or as I was saying in another email, >>> can the source write to Kafka/Flume/Hbase in addition to Postgres? >>> >>> Sent from Windows Mail >>> >>> *From:* Jeetendra Gangele <gangele...@gmail.com> >>> *Sent:* Tuesday, July 28, 2015 5:43 AM >>> *To:* santosh...@gmail.com >>> *Cc:* ayan guha <guha.a...@gmail.com>, felixcheun...@hotmail.com, >>> user@spark.apache.org >>> >>> I trying do that, but there will always data mismatch, since by the time >>> scoop is fetching main database will get so many updates. There is >>> something called incremental data fetch using scoop but that hits a >>> database rather than reading the WAL edit. >>> >>> >>> >>> On 28 July 2015 at 02:52, <santosh...@gmail.com> wrote: >>> >>>> Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead >>>> of hitting Postgres multiple times? >>>> >>>> Sent from Windows Mail >>>> >>>> *From:* ayan guha <guha.a...@gmail.com> >>>> *Sent:* Monday, July 27, 2015 4:41 PM >>>> *To:* Jeetendra Gangele <gangele...@gmail.com> >>>> *Cc:* felixcheun...@hotmail.com, user@spark.apache.org >>>> >>>> You can call dB connect once per partition. Please have a look at >>>> design patterns of for each construct in document. >>>> How big is your data in dB? How soon that data changes? You would be >>>> better off if data is in spark already >>>> On 28 Jul 2015 04:48, "Jeetendra Gangele" <gangele...@gmail.com> wrote: >>>> >>>>> Thanks for your reply. >>>>> >>>>> Parallel i will be hitting around 6000 call to postgreSQl which is not >>>>> good my database will die. >>>>> these calls to database will keeps on increasing. >>>>> Handling millions on request is not an issue with Hbase/NOSQL >>>>> >>>>> any other alternative? >>>>> >>>>> >>>>> >>>>> >>>>> On 27 July 2015 at 23:18, <felixcheun...@hotmail.com> wrote: >>>>> >>>>>> You can have Spark reading from PostgreSQL through the data access >>>>>> API. Do you have any concern with that approach since you mention copying >>>>>> that data into HBase. >>>>>> >>>>>> From: Jeetendra Gangele >>>>>> Sent: Monday, July 27, 6:00 AM >>>>>> Subject: Data from PostgreSQL to Spark >>>>>> To: user >>>>>> >>>>>> Hi All >>>>>> >>>>>> I have a use case where where I am consuming the Events from RabbitMQ >>>>>> using spark streaming.This event has some fields on which I want to query >>>>>> the PostgreSQL and bring the data and then do the join between event data >>>>>> and PostgreSQl data and put the aggregated data into HDFS, so that I run >>>>>> run analytics query over this data using SparkSQL. >>>>>> >>>>>> my question is PostgreSQL data in production data so i don't want to >>>>>> hit so many times. >>>>>> >>>>>> at any given 1 seconds time I may have 3000 events,that means I need >>>>>> to fire 3000 parallel query to my PostGreSQl and this data keeps on >>>>>> growing, so my database will go down. >>>>>> >>>>>> >>>>>> >>>>>> I can't migrate this PostgreSQL data since lots of system using >>>>>> it,but I can take this data to some NOSQL like base and query the Hbase, >>>>>> but here issue is How can I make sure that Hbase has upto date data? >>>>>> >>>>>> Any anyone suggest me best approach/ method to handle this case? >>>>>> >>>>>> Regards >>>>>> >>>>>> Jeetendra >>>>>> >>>>>>