Re: Correct way to use spark streaming with apache zeppelin

Skanda Sun, 13 Mar 2016 11:19:35 -0700

Hi

Storing states/intermediate data in realtime processing depends on how much
throughput/latency your application requires. There are lot of technologies
that help you build this realtime datastore. Some examples include HBase,
Memsql, etc or in some cases an RDBMS like MySQL itself. This is a
judgement that you will have to make.


Regards,
Skanda

On Sun, Mar 13, 2016 at 11:23 PM, trung kien <kient...@gmail.com> wrote:

> Thanks all for actively sharing your experience.
>
> @Chris: using something like Redis is something I am trying to figure out.
> I have  a lots of transactions, so I couldn't trigger update event for
> every single transaction.
> I'm looking at Spark Streaming because it provide batch processing (e.g I
> can update the cache every 5 seconds). In addition Spark can scale pretty
> well and I don't have to worry about losing data.
>
> Now having the cache with following information:
>              * Date
>              * BranchID
>              * ProductID
>              TotalQty
>              TotalDollar
>
> * is key, note that I have history data as well (byday).
>
> Now I want to use zeppelin for querying again the cache (while the cache
> is updating).
> I don't need the Zeppelin update automatically (I can hit the run button
> myself :) )
> Just curious if parquet is the right solution for us?
>
>
>
> On Sun, Mar 13, 2016 at 3:25 PM, Chris Miller <cmiller11...@gmail.com>
> wrote:
>
>> Cool! Thanks for sharing.
>>
>>
>> --
>> Chris Miller
>>
>> On Sun, Mar 13, 2016 at 12:53 AM, Todd Nist <tsind...@gmail.com> wrote:
>>
>>> Below is a link to an example which Silvio Fiorito put together
>>> demonstrating how to link Zeppelin with Spark Stream for real-time charts.
>>> I think the original thread was pack in early November 2015, subject: Real
>>> time chart in Zeppelin, if you care to try to find it.
>>>
>>> https://gist.github.com/granturing/a09aed4a302a7367be92
>>>
>>> HTH.
>>>
>>> -Todd
>>>
>>> On Sat, Mar 12, 2016 at 6:21 AM, Chris Miller <cmiller11...@gmail.com>
>>> wrote:
>>>
>>>> I'm pretty new to all of this stuff, so bare with me.
>>>>
>>>> Zeppelin isn't really intended for realtime dashboards as far as I
>>>> know. Its reporting features (tables, graphs, etc.) are more for displaying
>>>> the results from the output of something. As far as I know, there isn't
>>>> really anything to "watch" a dataset and have updates pushed to the
>>>> Zeppelin UI.
>>>>
>>>> As for Spark, unless you're doing a lot of processing that you didn't
>>>> mention here, I don't think it's a good fit just for this.
>>>>
>>>> If it were me (just off the top of my head), I'd just build a simple
>>>> web service that uses websockets to push updates to the client which could
>>>> then be used to update graphs, tables, etc. The data itself -- that is, the
>>>> accumulated totals -- you could store in something like Redis. When an
>>>> order comes in, just add that quantity and price to the existing value and
>>>> trigger your code to push out an updated value to any clients via the
>>>> websocket. You could use something like a Redis pub/sub channel to trigger
>>>> the web app to notify clients of an update.
>>>>
>>>> There are about 5 million other ways you could design this, but I would
>>>> just keep it as simple as possible. I just threw one idea out...
>>>>
>>>> Good luck.
>>>>
>>>>
>>>> --
>>>> Chris Miller
>>>>
>>>> On Sat, Mar 12, 2016 at 6:58 PM, trung kien <kient...@gmail.com> wrote:
>>>>
>>>>> Thanks Chris and Mich for replying.
>>>>>
>>>>> Sorry for not explaining my problem clearly.  Yes i am talking about a
>>>>> flexibke dashboard when mention Zeppelin.
>>>>>
>>>>> Here is the problem i am having:
>>>>>
>>>>> I am running a comercial website where we selle many products and we
>>>>> have many branchs in many place. We have a lots of realtime transactions
>>>>> and want to anaylyze it in realtime.
>>>>>
>>>>> We dont want every time doing analytics we have to aggregate every
>>>>> single transactions ( each transaction have BranchID, ProductID, Qty,
>>>>> Price). So, we maintain intermediate data which contains : BranchID,
>>>>> ProducrID, totalQty, totalDollar
>>>>>
>>>>> Ideally, we have 2 tables:
>>>>>    Transaction ( BranchID, ProducrID, Qty, Price, Timestamp)
>>>>>
>>>>> And intermediate table Stats is just sum of every transaction group by
>>>>> BranchID and ProductID( i am using Sparkstreaming to calculate this table
>>>>> realtime)
>>>>>
>>>>> My thinking is that doing statistics ( realtime dashboard)  on Stats
>>>>> table is much easier, this table is also not enough for maintain.
>>>>>
>>>>> I'm just wondering, whats the best way to store Stats table( a
>>>>> database or parquet file?)
>>>>> What exactly are you trying to do? Zeppelin is for interactive
>>>>> analysis of a dataset. What do you mean "realtime analytics" -- do you 
>>>>> mean
>>>>> build a report or dashboard that automatically updates as new data comes 
>>>>> in?
>>>>>
>>>>>
>>>>> --
>>>>> Chris Miller
>>>>>
>>>>> On Sat, Mar 12, 2016 at 3:13 PM, trung kien <kient...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I've just viewed some Zeppenlin's videos. The intergration between
>>>>>> Zeppenlin and Spark is really amazing and i want to use it for my
>>>>>> application.
>>>>>>
>>>>>> In my app, i will have a Spark streaming app to do some basic
>>>>>> realtime aggregation ( intermediate data). Then i want to use Zeppenlin 
>>>>>> to
>>>>>> do some realtime analytics on the intermediate data.
>>>>>>
>>>>>> My question is what's the most efficient storage engine to store
>>>>>> realtime intermediate data? Is parquet file somewhere is suitable?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Thanks
> Kien
>

Re: Correct way to use spark streaming with apache zeppelin

Reply via email to