Re: Spark support for Complex Event Processing (CEP)

Mich Talebzadeh Fri, 29 Apr 2016 03:58:09 -0700

ok like any work you need to start this from a simple model. take one bus
only (identified by bus number which is unique).


for any bus no N you have two logs LOG A and LOG B and LOG C the
coordinator from Central computer that sends estimated time of arrival
(ETA) to the bus stops. Pretty simple.

What is the difference in timestamp in LOG and LOG B for bus N? Are they
the same.

Your window for a give bus would be (start from station, deterministic,
already known) and End Time (complete round). The end time could be start
from station + 1 hour say or any bigger.

val ssc = new StreamingContext(sparkConf, Seconds(xx))

Then it is pretty simple.. You need to work out

val windowLength = xx
val slidingInterval = yy

For each bus you have two topics (LOG A and LOG B) plus LOG C that you need
to update based on LOG A and LOG B. outcome

Start from this simple heuristic model first

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 29 April 2016 at 09:54, Esa Heikkinen <esa.heikki...@student.tut.fi>
wrote:

>
> Hi
>
> I try to explain my case ..
>
> Situation is not so simple in my logs and solution. There also many types
> of logs and there are from many sources.
> They are as csv-format and header line includes names of the columns.
>
> This is simplified description of input logs.
>
> LOG A's: bus coordinate logs (every bus has own log):
> - timestamp
> - bus number
> - coordinates
>
> LOG B: bus login/logout (to/from line) message log:
> - timestamp
> - bus number
> - line number
>
> LOG C:  log from central computers:
> - timestamp
> - bus number
> - bus stop number
> - estimated arrival time to bus stop
>
> LOG A are updated every 30 seconds (i have also another system by 1
> seconds interval). LOG B are updated when bus starts from terminal bus stop
> and arrives to final bus stop in a line. LOG C is updated when central
> computer sends new arrival time estimation to bus stop.
>
> I also need metadata for logs (and analyzer). For example coordinates for
> bus stop areas.
>
> Main purpose of analyzing is to check an accuracy (error) of the estimated
> arrival time to bus stops.
>
> Because there are many buses and lines, it is too time-comsuming to check
> all of them. So i check only specific lines with specific bus stops. There
> are many buses (logged to lines) coming to one bus stop and i am interested
> about only certain bus.
>
> To do that, i have to read log partly not in time order (upstream) by
> sequence:
> 1. From LOG C is searched bus number
> 2. From LOG A is searched when the bus has leaved from terminal bus stop
> 3. From LOG B is searched when bus has sent a login to the line
> 4. From LOG A is searched when the bus has entered to bus stop
> 5. From LOG C is searched a last estimated arrival time to the bus stop
> and calculates error between real and estimated value
>
> In my understanding (almost) all log file analyzers reads all data (lines)
> in time order from log files. My need is only for specific part of log
> (lines). To achieve that, my solution is to read logs in an arbitrary order
> (with given time window).
>
> I know this solution is not suitable for all cases (for example for very
> fast analyzing and very big data). This solution is suitable for very
> complex (targeted) analyzing. It can be too slow and memory-consuming, but
> well done pre-processing of log data can help a lot.
>
> ---
> Esa Heikkinen
>
>
> 28.4.2016, 14:44, Michael Segel kirjoitti:
>
> I don’t.
>
> I believe that there have been a  couple of hack-a-thons like one done in
> Chicago a few years back using public transportation data.
>
> The first question is what sort of data do you get from the city?
>
> I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y).
> Or they could provide more information. Like last stop, distance to next
> stop, avg current velocity…
>
> Then there is the frequency of the updates. Every second? Every 3 seconds?
> 5 or 6 seconds…
>
> This will determine how much work you have to do.
>
> Maybe they provide the routes of the busses via a different API call since
> its relatively static.
>
> This will drive your solution more than the underlying technology.
>
> Oh and whileI focused on bus, there are also rail and other modes of
> public transportation like light rail, trains, etc …
>
> HTH
>
> -Mike
>
>
> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi>
> wrote:
>
>
> Do you know any good examples how to use Spark streaming in tracking
> public transportation systems ?
>
> Or Storm or some other tool example ?
>
> Regards
> Esa Heikkinen
>
> 28.4.2016, 3:16, Michael Segel kirjoitti:
>
> Uhm…
> I think you need to clarify a couple of things…
>
> First there is this thing called analog signal processing…. Is that
> continuous enough for you?
>
> But more to the point, Spark Streaming does micro batching so if you’re
> processing a continuous stream of tick data, you will have more than 50K of
> tics per second while there are markets open and trading.  Even at 50K a
> second, that would mean 1 every .02 ms or 50 ticks a ms.
>
> And you don’t want to wait until you have a batch to start processing, but
> you want to process when the data hits the queue and pull it from the queue
> as quickly as possible.
>
> Spark streaming will be able to pull batches in as little as 500ms. So if
> you pull a batch at t0 and immediately have a tick in your queue, you won’t
> process that data until t0+500ms. And said batch would contain 25,000
> entries.
>
> Depending on what you are doing… that 500ms delay can be enough to be
> fatal to your trading process.
>
> If you don’t like stock data, there are other examples mainly when pulling
> data from real time embedded systems.
>
>
> If you go back and read what I said, if your data flow is >> (much slower)
> than 500ms, and / or the time to process is >> 500ms ( much longer )  you
> could use spark streaming.  If not… and there are applications which
> require that type of speed…  then you shouldn’t use spark streaming.
>
> If you do have that constraint, then you can look at systems like
> storm/flink/samza / whatever where you have a continuous queue and listener
> and no micro batch delays.
> Then for each bolt (storm) you can have a spark context for processing the
> data. (Depending on what sort of processing you want to do.)
>
> To put this in perspective… if you’re using spark streaming / akka / storm
> /etc to handle real time requests from the web, 500ms added delay can be a
> long time.
>
> Choose the right tool.
>
> For the OP’s problem. Sure Tracking public transportation could be done
> using spark streaming. It could also be done using half a dozen other tools
> because the rate of data generation is much slower than 500ms.
>
> HTH
>
>
> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> couple of things.
>
> There is no such thing as Continuous Data Streaming as there is no such
> thing as Continuous Availability.
>
> There is such thing as Discrete Data Streaming and  High Availability  but
> they reduce the finite unavailability to minimum. In terms of business
> needs a 5 SIGMA is good enough and acceptable. Even the candles set to a
> predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader
> makes a sell or buy decision on the basis of 2 seconds candlestick
>
> The calculation itself in measurements is subject to finite error as
> defined by their Confidence Level (CL) using Standard Deviation function.
>
> OK so far I have never noticed a tool that requires that details of
> granularity. Those stuff from Flink etc is in practical term is of little
> value and does not make commercial sense.
>
> Now with regard to your needs, Spark micro batching is perfectly adequate.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> <http://talebzadehmich.wordpress.com/>http://talebzadehmich.wordpress.com
>
>
>
> On 27 April 2016 at 22:10, Esa Heikkinen <esa.heikki...@student.tut.fi>
> wrote:
>
>>
>> Hi
>>
>> Thanks for the answer.
>>
>> I have developed a log file analyzer for RTPIS (Real Time Passenger
>> Information System) system, where buses drive lines and the system try to
>> estimate the arrival times to the bus stops. There are many different log
>> files (and events) and analyzing situation can be very complex. Also
>> spatial data can be included to the log data.
>>
>> The analyzer also has a query (or analyzing) language, which describes a
>> expected behavior. This can be a requirement of system. Analyzer can be
>> think to be also a test oracle.
>>
>> I have published a paper (SPLST'15 conference) about my analyzer and its
>> language. The paper is maybe too technical, but it is found:
>> http://ceur-ws.org/Vol-1525/paper-19.pdf
>>
>> I do not know yet where it belongs. I think it can be some "CEP with
>> delays". Or do you know better ?
>> My analyzer can also do little bit more complex and time-consuming
>> analyzings? There are no a need for real time.
>>
>> And it is possible to do "CEP with delays" reasonably some existing
>> analyzer (for example Spark) ?
>>
>> Regards
>> PhD student at Tampere University of Technology, Finland, www.tut.fi
>> Esa Heikkinen
>>
>> 27.4.2016, 15:51, Michael Segel kirjoitti:
>>
>> Spark and CEP? It depends…
>>
>> Ok, I know that’s not the answer you want to hear, but its a bit more
>> complicated…
>>
>> If you consider Spark Streaming, you have some issues.
>> Spark Streaming isn’t a Real Time solution because it is a micro batch
>> solution. The smallest Window is 500ms.  This means that if your compute
>> time is >> 500ms and/or  your event flow is >> 500ms this could work.
>> (e.g. 'real time' image processing on a system that is capturing 60FPS
>> because the processing time is >> 500ms. )
>>
>> So Spark Streaming wouldn’t be the best solution….
>>
>> However, you can combine spark with other technologies like Storm, Akka,
>> etc .. where you have continuous streaming.
>> So you could instantiate a spark context per worker in storm…
>>
>> I think if there are no class collisions between Akka and Spark, you
>> could use Akka, which may have a better potential for communication between
>> workers.
>> So here you can handle CEP events.
>>
>> HTH
>>
>> On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> please see my other reply
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 27 April 2016 at 10:40, Esa Heikkinen <esa.heikki...@student.tut.fi>
>> wrote:
>>
>>> Hi
>>>
>>> I have followed with interest the discussion about CEP and Spark. It is
>>> quite close to my research, which is a complex analyzing for log files and
>>> "history" data  (not actually for real time streams).
>>>
>>> I have few questions:
>>>
>>> 1) Is CEP only for (real time) stream data and not for "history" data?
>>>
>>> 2) Is it possible to search "backward" (upstream) by CEP with given time
>>> window? If a start time of the time window is earlier than the current
>>> stream time.
>>>
>>> 3) Do you know any good tools or softwares for "CEP's" using for log
>>> data ?
>>>
>>> 4) Do you know any good (scientific) papers i should read about CEP ?
>>>
>>>
>>> Regards
>>> PhD student at Tampere University of Technology, Finland, www.tut.fi
>>> Esa Heikkinen
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>>
>>
>>
>>
>>
>>
>>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>
>
>
>
>

Re: Spark support for Complex Event Processing (CEP)

Reply via email to