> I still don't understand why writing to a transactional database with locking > and concurrency (read and writes) through JDBC will be fast for this sort of > data ingestion.
Who cares about fast if your data is wrong? And it's still plenty fast enough https://youtu.be/NVl9_6J1G60?list=WL&t=1819 https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/ On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > The way I see this, there are two things involved. > > Data ingestion through source to Kafka > Date conversion and Storage ETL/ELT > Presentation > > Item 2 is the one that needs to be designed correctly. I presume raw data > has to confirm to some form of MDM that requires schema mapping etc before > putting into persistent storage (DB, HDFS etc). Which one to choose depends > on your volume of ingestion and your cluster size and complexity of data > conversion. Then your users will use some form of UI (Tableau, QlikView, > Zeppelin, direct SQL) to query data one way or other. Your users can > directly use UI like Tableau that offer in built analytics on SQL. Spark sql > offers the same). Your mileage varies according to your needs. > > I still don't understand why writing to a transactional database with > locking and concurrency (read and writes) through JDBC will be fast for this > sort of data ingestion. If you ask me if I wanted to choose an RDBMS to > write to as my sink,I would use Oracle which offers the best locking and > concurrency among RDBMs and also handles key value pairs as well (assuming > that is what you want). In addition, it can be used as a Data Warehouse as > well. > > HTH > > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > Disclaimer: Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > > On 29 September 2016 at 16:49, Ali Akhtar <ali.rac...@gmail.com> wrote: >> >> The business use case is to read a user's data from a variety of different >> services through their API, and then allowing the user to query that data, >> on a per service basis, as well as an aggregate across all services. >> >> The way I'm considering doing it, is to do some basic ETL (drop all the >> unnecessary fields, rename some fields into something more manageable, etc) >> and then store the data in Cassandra / Postgres. >> >> Then, when the user wants to view a particular report, query the >> respective table in Cassandra / Postgres. (select .. from data where user = >> ? and date between <start> and <end> and some_field = ?) >> >> How will Spark Streaming help w/ aggregation? Couldn't the data be queried >> from Cassandra / Postgres via the Kafka consumer and aggregated that way? >> >> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >>> >>> No, direct stream in and of itself won't ensure an end-to-end >>> guarantee, because it doesn't know anything about your output actions. >>> >>> You still need to do some work. The point is having easy access to >>> offsets for batches on a per-partition basis makes it easier to do >>> that work, especially in conjunction with aggregation. >>> >>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <deepakmc...@gmail.com> >>> wrote: >>> > If you use spark direct streams , it ensure end to end guarantee for >>> > messages. >>> > >>> > >>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <ali.rac...@gmail.com> >>> > wrote: >>> >> >>> >> My concern with Postgres / Cassandra is only scalability. I will look >>> >> further into Postgres horizontal scaling, thanks. >>> >> >>> >> Writes could be idempotent if done as upserts, otherwise updates will >>> >> be >>> >> idempotent but not inserts. >>> >> >>> >> Data should not be lost. The system should be as fault tolerant as >>> >> possible. >>> >> >>> >> What's the advantage of using Spark for reading Kafka instead of >>> >> direct >>> >> Kafka consumers? >>> >> >>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org> >>> >> wrote: >>> >>> >>> >>> I wouldn't give up the flexibility and maturity of a relational >>> >>> database, unless you have a very specific use case. I'm not trashing >>> >>> cassandra, I've used cassandra, but if all I know is that you're >>> >>> doing >>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc >>> >>> aggregations without a lot of forethought. If you're worried about >>> >>> scaling, there are several options for horizontally scaling Postgres >>> >>> in particular. One of the current best from what I've worked with is >>> >>> Citus. >>> >>> >>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma >>> >>> <deepakmc...@gmail.com> >>> >>> wrote: >>> >>> > Hi Cody >>> >>> > Spark direct stream is just fine for this use case. >>> >>> > But why postgres and not cassandra? >>> >>> > Is there anything specific here that i may not be aware? >>> >>> > >>> >>> > Thanks >>> >>> > Deepak >>> >>> > >>> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger >>> >>> > <c...@koeninger.org> >>> >>> > wrote: >>> >>> >> >>> >>> >> How are you going to handle etl failures? Do you care about lost >>> >>> >> / >>> >>> >> duplicated data? Are your writes idempotent? >>> >>> >> >>> >>> >> Absent any other information about the problem, I'd stay away from >>> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream >>> >>> >> feeding postgres. >>> >>> >> >>> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar >>> >>> >> <ali.rac...@gmail.com> >>> >>> >> wrote: >>> >>> >> > Is there an advantage to that vs directly consuming from Kafka? >>> >>> >> > Nothing >>> >>> >> > is >>> >>> >> > being done to the data except some light ETL and then storing it >>> >>> >> > in >>> >>> >> > Cassandra >>> >>> >> > >>> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma >>> >>> >> > <deepakmc...@gmail.com> >>> >>> >> > wrote: >>> >>> >> >> >>> >>> >> >> Its better you use spark's direct stream to ingest from kafka. >>> >>> >> >> >>> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar >>> >>> >> >> <ali.rac...@gmail.com> >>> >>> >> >> wrote: >>> >>> >> >>> >>> >>> >> >>> I don't think I need a different speed storage and batch >>> >>> >> >>> storage. >>> >>> >> >>> Just >>> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it >>> >>> >> >>> somewhere >>> >>> >> >>> where >>> >>> >> >>> the web UI can query it, seems like it will be enough. >>> >>> >> >>> >>> >>> >> >>> I'm thinking about: >>> >>> >> >>> >>> >>> >> >>> - Reading data from Kafka via Spark Streaming >>> >>> >> >>> - Standardizing, then storing it in Cassandra >>> >>> >> >>> - Querying Cassandra from the web ui >>> >>> >> >>> >>> >>> >> >>> That seems like it will work. My question now is whether to >>> >>> >> >>> use >>> >>> >> >>> Spark >>> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly. >>> >>> >> >>> >>> >>> >> >>> >>> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh >>> >>> >> >>> <mich.talebza...@gmail.com> wrote: >>> >>> >> >>>> >>> >>> >> >>>> - Spark Streaming to read data from Kafka >>> >>> >> >>>> - Storing the data on HDFS using Flume >>> >>> >> >>>> >>> >>> >> >>>> You don't need Spark streaming to read data from Kafka and >>> >>> >> >>>> store >>> >>> >> >>>> on >>> >>> >> >>>> HDFS. It is a waste of resources. >>> >>> >> >>>> >>> >>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly >>> >>> >> >>>> >>> >>> >> >>>> KafkaAgent.sources = kafka-sources >>> >>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs >>> >>> >> >>>> >>> >>> >> >>>> That will be for your batch layer. To analyse you can >>> >>> >> >>>> directly >>> >>> >> >>>> read >>> >>> >> >>>> from >>> >>> >> >>>> hdfs files with Spark or simply store data in a database of >>> >>> >> >>>> your >>> >>> >> >>>> choice via >>> >>> >> >>>> cron or something. Do not mix your batch layer with speed >>> >>> >> >>>> layer. >>> >>> >> >>>> >>> >>> >> >>>> Your speed layer will ingest the same data directly from >>> >>> >> >>>> Kafka >>> >>> >> >>>> into >>> >>> >> >>>> spark streaming and that will be online or near real time >>> >>> >> >>>> (defined >>> >>> >> >>>> by your >>> >>> >> >>>> window). >>> >>> >> >>>> >>> >>> >> >>>> Then you have a a serving layer to present data from both >>> >>> >> >>>> speed >>> >>> >> >>>> (the >>> >>> >> >>>> one from SS) and batch layer. >>> >>> >> >>>> >>> >>> >> >>>> HTH >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> Dr Mich Talebzadeh >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> LinkedIn >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> http://talebzadehmich.wordpress.com >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> Disclaimer: Use it at your own risk. Any and all >>> >>> >> >>>> responsibility >>> >>> >> >>>> for >>> >>> >> >>>> any >>> >>> >> >>>> loss, damage or destruction of data or any other property >>> >>> >> >>>> which >>> >>> >> >>>> may >>> >>> >> >>>> arise >>> >>> >> >>>> from relying on this email's technical content is explicitly >>> >>> >> >>>> disclaimed. The >>> >>> >> >>>> author will in no case be liable for any monetary damages >>> >>> >> >>>> arising >>> >>> >> >>>> from such >>> >>> >> >>>> loss, damage or destruction. >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> >>> >>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar >>> >>> >> >>>> <ali.rac...@gmail.com> >>> >>> >> >>>> wrote: >>> >>> >> >>>>> >>> >>> >> >>>>> The web UI is actually the speed layer, it needs to be able >>> >>> >> >>>>> to >>> >>> >> >>>>> query >>> >>> >> >>>>> the data online, and show the results in real-time. >>> >>> >> >>>>> >>> >>> >> >>>>> It also needs a custom front-end, so a system like Tableau >>> >>> >> >>>>> can't >>> >>> >> >>>>> be >>> >>> >> >>>>> used, it must have a custom backend + front-end. >>> >>> >> >>>>> >>> >>> >> >>>>> Thanks for the recommendation of Flume. Do you think this >>> >>> >> >>>>> will >>> >>> >> >>>>> work: >>> >>> >> >>>>> >>> >>> >> >>>>> - Spark Streaming to read data from Kafka >>> >>> >> >>>>> - Storing the data on HDFS using Flume >>> >>> >> >>>>> - Using Spark to query the data in the backend of the web >>> >>> >> >>>>> UI? >>> >>> >> >>>>> >>> >>> >> >>>>> >>> >>> >> >>>>> >>> >>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh >>> >>> >> >>>>> <mich.talebza...@gmail.com> wrote: >>> >>> >> >>>>>> >>> >>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka >>> >>> >> >>>>>> can >>> >>> >> >>>>>> be >>> >>> >> >>>>>> stored on HDFS using flume. >>> >>> >> >>>>>> >>> >>> >> >>>>>> - Query this data to generate reports / analytics (There >>> >>> >> >>>>>> will >>> >>> >> >>>>>> be a >>> >>> >> >>>>>> web UI which will be the front-end to the data, and will >>> >>> >> >>>>>> show >>> >>> >> >>>>>> the >>> >>> >> >>>>>> reports) >>> >>> >> >>>>>> >>> >>> >> >>>>>> This is basically batch layer and you need something like >>> >>> >> >>>>>> Tableau >>> >>> >> >>>>>> or >>> >>> >> >>>>>> Zeppelin to query data >>> >>> >> >>>>>> >>> >>> >> >>>>>> You will also need spark streaming to query data online for >>> >>> >> >>>>>> speed >>> >>> >> >>>>>> layer. That data could be stored in some transient fabric >>> >>> >> >>>>>> like >>> >>> >> >>>>>> ignite or >>> >>> >> >>>>>> even druid. >>> >>> >> >>>>>> >>> >>> >> >>>>>> HTH >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> Dr Mich Talebzadeh >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> LinkedIn >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> http://talebzadehmich.wordpress.com >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all >>> >>> >> >>>>>> responsibility >>> >>> >> >>>>>> for >>> >>> >> >>>>>> any loss, damage or destruction of data or any other >>> >>> >> >>>>>> property >>> >>> >> >>>>>> which >>> >>> >> >>>>>> may >>> >>> >> >>>>>> arise from relying on this email's technical content is >>> >>> >> >>>>>> explicitly >>> >>> >> >>>>>> disclaimed. The author will in no case be liable for any >>> >>> >> >>>>>> monetary >>> >>> >> >>>>>> damages >>> >>> >> >>>>>> arising from such loss, damage or destruction. >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar >>> >>> >> >>>>>> <ali.rac...@gmail.com> >>> >>> >> >>>>>> wrote: >>> >>> >> >>>>>>> >>> >>> >> >>>>>>> It needs to be able to scale to a very large amount of >>> >>> >> >>>>>>> data, >>> >>> >> >>>>>>> yes. >>> >>> >> >>>>>>> >>> >>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >>> >>> >> >>>>>>> <deepakmc...@gmail.com> wrote: >>> >>> >> >>>>>>>> >>> >>> >> >>>>>>>> What is the message inflow ? >>> >>> >> >>>>>>>> If it's really high , definitely spark will be of great >>> >>> >> >>>>>>>> use . >>> >>> >> >>>>>>>> >>> >>> >> >>>>>>>> Thanks >>> >>> >> >>>>>>>> Deepak >>> >>> >> >>>>>>>> >>> >>> >> >>>>>>>> >>> >>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" >>> >>> >> >>>>>>>> <ali.rac...@gmail.com> >>> >>> >> >>>>>>>> wrote: >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for >>> >>> >> >>>>>>>>> ideas. >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and >>> >>> >> >>>>>>>>> writing >>> >>> >> >>>>>>>>> their >>> >>> >> >>>>>>>>> raw data into Kafka. >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> I need to: >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> - Do ETL on the data, and standardize it. >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / >>> >>> >> >>>>>>>>> Cassandra / >>> >>> >> >>>>>>>>> Raw >>> >>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres) >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> - Query this data to generate reports / analytics (There >>> >>> >> >>>>>>>>> will be >>> >>> >> >>>>>>>>> a >>> >>> >> >>>>>>>>> web UI which will be the front-end to the data, and will >>> >>> >> >>>>>>>>> show >>> >>> >> >>>>>>>>> the reports) >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> Java is being used as the backend language for >>> >>> >> >>>>>>>>> everything >>> >>> >> >>>>>>>>> (backend >>> >>> >> >>>>>>>>> of the web UI, as well as the ETL layer) >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> I'm considering: >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the >>> >>> >> >>>>>>>>> ETL >>> >>> >> >>>>>>>>> layer >>> >>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it) >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the >>> >>> >> >>>>>>>>> standardized >>> >>> >> >>>>>>>>> data, and to allow queries >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark >>> >>> >> >>>>>>>>> to >>> >>> >> >>>>>>>>> run >>> >>> >> >>>>>>>>> queries across the data (mostly filters), or directly >>> >>> >> >>>>>>>>> run >>> >>> >> >>>>>>>>> queries against >>> >>> >> >>>>>>>>> Cassandra / HBase >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of >>> >>> >> >>>>>>>>> these >>> >>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka >>> >>> >> >>>>>>>>> consumers vs >>> >>> >> >>>>>>>>> Spark for >>> >>> >> >>>>>>>>> ETL, which persistent data store to use, and how to >>> >>> >> >>>>>>>>> query >>> >>> >> >>>>>>>>> that >>> >>> >> >>>>>>>>> data store in >>> >>> >> >>>>>>>>> the backend of the web UI, for displaying the reports). >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> >>> >>> >> >>>>>>>>> Thanks. >>> >>> >> >>>>>>> >>> >>> >> >>>>>>> >>> >>> >> >>>>>> >>> >>> >> >>>>> >>> >>> >> >>>> >>> >>> >> >>> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> -- >>> >>> >> >> Thanks >>> >>> >> >> Deepak >>> >>> >> >> www.bigdatabig.com >>> >>> >> >> www.keosha.net >>> >>> >> > >>> >>> >> > >>> >>> >> >>> >>> >> >>> >>> >> --------------------------------------------------------------------- >>> >>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >>> >>> > >>> >>> > >>> >>> > >>> >>> > -- >>> >>> > Thanks >>> >>> > Deepak >>> >>> > www.bigdatabig.com >>> >>> > www.keosha.net >>> >> >>> >> >>> > >>> > >>> > >>> > -- >>> > Thanks >>> > Deepak >>> > www.bigdatabig.com >>> > www.keosha.net >> >> >