Can you put some transparent cache in front of the database? Or some jdbc proxy?
Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele <gangele...@gmail.com> a écrit : > can the source write to Kafka/Flume/Hbase in addition to Postgres? no > it can't write ,this is due to the fact that there are many applications > those are producing this postGreSql data.I can't really asked all the teams > to start writing to some other source. > > > velocity of the application is too high. > > > > > > > On 28 July 2015 at 21:50, <santosh...@gmail.com> wrote: > >> Sqoop’s incremental data fetch will reduce the data size you need to >> pull from source, but then by the time that incremental data fetch is >> complete, is it not current again, if velocity of the data is high? >> >> May be you can put a trigger in Postgres to send data to the big data >> cluster as soon as changes are made. Or as I was saying in another email, >> can the source write to Kafka/Flume/Hbase in addition to Postgres? >> >> Sent from Windows Mail >> >> *From:* Jeetendra Gangele <gangele...@gmail.com> >> *Sent:* Tuesday, July 28, 2015 5:43 AM >> *To:* santosh...@gmail.com >> *Cc:* ayan guha <guha.a...@gmail.com>, felixcheun...@hotmail.com, >> user@spark.apache.org >> >> I trying do that, but there will always data mismatch, since by the time >> scoop is fetching main database will get so many updates. There is >> something called incremental data fetch using scoop but that hits a >> database rather than reading the WAL edit. >> >> >> >> On 28 July 2015 at 02:52, <santosh...@gmail.com> wrote: >> >>> Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) >>> instead of hitting Postgres multiple times? >>> >>> Sent from Windows Mail >>> >>> *From:* ayan guha <guha.a...@gmail.com> >>> *Sent:* Monday, July 27, 2015 4:41 PM >>> *To:* Jeetendra Gangele <gangele...@gmail.com> >>> *Cc:* felixcheun...@hotmail.com, user@spark.apache.org >>> >>> You can call dB connect once per partition. Please have a look at design >>> patterns of for each construct in document. >>> How big is your data in dB? How soon that data changes? You would be >>> better off if data is in spark already >>> On 28 Jul 2015 04:48, "Jeetendra Gangele" <gangele...@gmail.com> wrote: >>> >>>> Thanks for your reply. >>>> >>>> Parallel i will be hitting around 6000 call to postgreSQl which is not >>>> good my database will die. >>>> these calls to database will keeps on increasing. >>>> Handling millions on request is not an issue with Hbase/NOSQL >>>> >>>> any other alternative? >>>> >>>> >>>> >>>> >>>> On 27 July 2015 at 23:18, <felixcheun...@hotmail.com> wrote: >>>> >>>>> You can have Spark reading from PostgreSQL through the data access >>>>> API. Do you have any concern with that approach since you mention copying >>>>> that data into HBase. >>>>> >>>>> From: Jeetendra Gangele >>>>> Sent: Monday, July 27, 6:00 AM >>>>> Subject: Data from PostgreSQL to Spark >>>>> To: user >>>>> >>>>> Hi All >>>>> >>>>> I have a use case where where I am consuming the Events from RabbitMQ >>>>> using spark streaming.This event has some fields on which I want to query >>>>> the PostgreSQL and bring the data and then do the join between event data >>>>> and PostgreSQl data and put the aggregated data into HDFS, so that I run >>>>> run analytics query over this data using SparkSQL. >>>>> >>>>> my question is PostgreSQL data in production data so i don't want to >>>>> hit so many times. >>>>> >>>>> at any given 1 seconds time I may have 3000 events,that means I need >>>>> to fire 3000 parallel query to my PostGreSQl and this data keeps on >>>>> growing, so my database will go down. >>>>> >>>>> >>>>> >>>>> I can't migrate this PostgreSQL data since lots of system using it,but >>>>> I can take this data to some NOSQL like base and query the Hbase, but here >>>>> issue is How can I make sure that Hbase has upto date data? >>>>> >>>>> Any anyone suggest me best approach/ method to handle this case? >>>>> >>>>> Regards >>>>> >>>>> Jeetendra >>>>> >>>>> >> >> >> >> > > > >