HBase for real time queries? HBase was designed with the batch in mind. Impala should be a best choice, but i do not know what Druid can do....
Cheers Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>: > Hi Chanh, > > Druid sounds like a good choice. > > But again the point being is that what else Druid brings on top of Hbase. > > Unless one decides to use Druid for both historical data and real time > data in place of Hbase! > > It is easier to write API against Druid that Hbase? You still want a UI > dashboard? > > Cheers > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 30 August 2016 at 03:19, Chanh Le <giaosu...@gmail.com> wrote: > >> Hi everyone, >> >> Seems a lot people using Druid for realtime Dashboard. >> I’m just wondering of using Druid for main storage engine because Druid >> can store the raw data and can integrate with Spark also (theoretical). >> In that case do we need to store 2 separate storage Druid (store segment >> in HDFS) and HDFS?. >> BTW did anyone try this one https://github.com/Sparkli >> neData/spark-druid-olap? >> >> >> Regards, >> Chanh >> >> >> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Thanks Bhaarat and everyone. >> >> This is an updated version of the same diagram >> >> <LambdaArchitecture.png> >> >> The frequency of Recent data is defined by the Windows length in Spark >> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can >> move any Spark granularity below 0.5 seconds in anger. For some >> applications like Credit card transactions and fraud detection. Data is >> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as >> well. The same Spark Streaming will write asynchronously to HDFS Hive >> tables. >> One school of thought is never write to Hive from Spark, write straight >> to Hbase and then read Hbase tables into Hive periodically? >> >> Now the third component in this layer is Serving Layer that can combine >> data from the current (Hbase) and the historical (Hive tables) to give the >> user visual analytics. Now that visual analytics can be Real time dashboard >> on top of Serving Layer. That Serving layer could be an in-memory NoSQL >> offering or Data from Hbase (Red Box) combined with Hive tables. >> >> I am not aware of any industrial strength Real time Dashboard. The idea >> is that one uses such dashboard in real time. Dashboard in this sense >> meaning a general purpose API to data store of some type like on Serving >> layer to provide visual analytics real time on demand, combining real time >> data and aggregate views. As usual the devil in the detail. >> >> >> >> Let me know your thoughts. Anyway this is first cut pattern. >> >> >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 29 August 2016 at 18:53, Bhaarat Sharma <bhaara...@gmail.com> wrote: >> >>> Hi Mich >>> >>> This is really helpful. I'm trying to wrap my head around the last >>> diagram you shared (the one with kafka). In this diagram spark streaming is >>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time >>> Queries, Dashboards" annotation. Based on this diagram, will real time >>> queries be running on Spark or HBase? >>> >>> PS: My intention was not to steer the conversation away from what Ashok >>> asked but I found the diagrams shared by Mich very insightful. >>> >>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> In terms of positioning, Spark is really the first Big Data platform to >>>> integrate batch, streaming and interactive computations in a unified >>>> framework. What this boils down to is the fact that whichever way one look >>>> at it there is somewhere that Spark can make a contribution to. In general, >>>> there are few design patterns common to Big Data >>>> >>>> >>>> >>>> - *ETL & Batch* >>>> >>>> The first one is the most common one with Established tools like Sqoop, >>>> Talend for ETL and HDFS for storage of some kind. Spark can be used as the >>>> execution engine for Hive at the storage level which actually makes >>>> it a true vendor independent (BTW, Impala and Tez and LLAP) are offered by >>>> vendors) processing engine. Personally I use Spark at ETL layer by >>>> extracting data from sources through plug ins (JDBC and others) and storing >>>> in on HDFS in some kind >>>> >>>> >>>> >>>> - *Batch, real time plus Analytics* >>>> >>>> In this pattern you have data coming in real time and you want to query >>>> them real time through real time dashboard. HDFS is not ideal for updating >>>> data in real time and neither for random access of data. Source could be >>>> all sorts of Web Servers and need Flume Agent with Flume. At the storage >>>> layer we are probably looking at something like Hbase. The crucial point >>>> being that saved data needs to be ready for queries immediately The >>>> dashboards requires Hbase APIs. The Analytics can be done through Hive >>>> again running on Spark engine. Again note here that we ideally should >>>> process batch and real time separately. >>>> >>>> >>>> >>>> - *Real time / Streaming* >>>> >>>> This is most relevant to Spark as we are moving to near real time. >>>> Where Spark excels. We need to capture the incoming events (logs, sensor >>>> data, pricing, emails) through interfaces like Kafka, Message Queues etc. >>>> Need to process these events with minimum latency. Again Spark is a >>>> very good candidate here with its Spark Streaming and micro-batching >>>> capabilities. There are others like Storm, Flink etc. that are event based >>>> but you don’t hear much. Again for streaming architecture you need to sync >>>> data in real time using something like Hbase, Cassandra (?) and others as >>>> real time store or forever storage HDFS or Hive etc. >>>> >>>> >>>> In general there is also *Lambda Architecture* that is >>>> designed for streaming analytics. The streaming data ends up in both batch >>>> layer and speed layer. Batch layer is used to answer batch queries. On the >>>> other hand speed later is used ti handle fast/real time queries. This model >>>> is really cool as Spark Streaming can feed both the batch layer and >>>> the speed layer. >>>> >>>> >>>> At a high level this looks like this, from >>>> http://lambda-architecture.net/ >>>> >>>> <image.png> >>>> >>>> >>>> >>>> >>>> >>>> My favourite would be something like below with Spark playing a major >>>> role >>>> >>>> >>>> <LambdaArchitecture.png> >>>> >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 28 August 2016 at 19:43, Sivakumaran S <siva.kuma...@me.com> wrote: >>>> >>>>> Spark best fits for processing. But depending on the use case, you >>>>> could expand the scope of Spark to moving data using the native >>>>> connectors. >>>>> The only that Spark is not, is Storage. Connectors are available for most >>>>> storage options though. >>>>> >>>>> Regards, >>>>> >>>>> Sivakumaran S >>>>> >>>>> >>>>> >>>>> On 28-Aug-2016, at 6:04 PM, Ashok Kumar <ashok34...@yahoo.com.INVALID >>>>> <ashok34...@yahoo.com.invalid>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> There are design patterns that use Spark extensively. I am new to this >>>>> area so I would appreciate if someone explains where Spark fits in >>>>> especially within faster or streaming use case. >>>>> >>>>> What are the best practices involving Spark. Is it always best to >>>>> deploy it for processing engine, >>>>> >>>>> For example when we have a pattern >>>>> >>>>> Input Data -> Data in Motion -> Processing -> Storage >>>>> >>>>> Where does Spark best fit in. >>>>> >>>>> Thanking you >>>>> >>>>> >>>> >>> >> >> >