ok like any work you need to start this from a simple model. take one bus only (identified by bus number which is unique).
for any bus no N you have two logs LOG A and LOG B and LOG C the coordinator from Central computer that sends estimated time of arrival (ETA) to the bus stops. Pretty simple. What is the difference in timestamp in LOG and LOG B for bus N? Are they the same. Your window for a give bus would be (start from station, deterministic, already known) and End Time (complete round). The end time could be start from station + 1 hour say or any bigger. val ssc = new StreamingContext(sparkConf, Seconds(xx)) Then it is pretty simple.. You need to work out val windowLength = xx val slidingInterval = yy For each bus you have two topics (LOG A and LOG B) plus LOG C that you need to update based on LOG A and LOG B. outcome Start from this simple heuristic model first HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 29 April 2016 at 09:54, Esa Heikkinen <esa.heikki...@student.tut.fi> wrote: > > Hi > > I try to explain my case .. > > Situation is not so simple in my logs and solution. There also many types > of logs and there are from many sources. > They are as csv-format and header line includes names of the columns. > > This is simplified description of input logs. > > LOG A's: bus coordinate logs (every bus has own log): > - timestamp > - bus number > - coordinates > > LOG B: bus login/logout (to/from line) message log: > - timestamp > - bus number > - line number > > LOG C: log from central computers: > - timestamp > - bus number > - bus stop number > - estimated arrival time to bus stop > > LOG A are updated every 30 seconds (i have also another system by 1 > seconds interval). LOG B are updated when bus starts from terminal bus stop > and arrives to final bus stop in a line. LOG C is updated when central > computer sends new arrival time estimation to bus stop. > > I also need metadata for logs (and analyzer). For example coordinates for > bus stop areas. > > Main purpose of analyzing is to check an accuracy (error) of the estimated > arrival time to bus stops. > > Because there are many buses and lines, it is too time-comsuming to check > all of them. So i check only specific lines with specific bus stops. There > are many buses (logged to lines) coming to one bus stop and i am interested > about only certain bus. > > To do that, i have to read log partly not in time order (upstream) by > sequence: > 1. From LOG C is searched bus number > 2. From LOG A is searched when the bus has leaved from terminal bus stop > 3. From LOG B is searched when bus has sent a login to the line > 4. From LOG A is searched when the bus has entered to bus stop > 5. From LOG C is searched a last estimated arrival time to the bus stop > and calculates error between real and estimated value > > In my understanding (almost) all log file analyzers reads all data (lines) > in time order from log files. My need is only for specific part of log > (lines). To achieve that, my solution is to read logs in an arbitrary order > (with given time window). > > I know this solution is not suitable for all cases (for example for very > fast analyzing and very big data). This solution is suitable for very > complex (targeted) analyzing. It can be too slow and memory-consuming, but > well done pre-processing of log data can help a lot. > > --- > Esa Heikkinen > > > 28.4.2016, 14:44, Michael Segel kirjoitti: > > I don’t. > > I believe that there have been a couple of hack-a-thons like one done in > Chicago a few years back using public transportation data. > > The first question is what sort of data do you get from the city? > > I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y). > Or they could provide more information. Like last stop, distance to next > stop, avg current velocity… > > Then there is the frequency of the updates. Every second? Every 3 seconds? > 5 or 6 seconds… > > This will determine how much work you have to do. > > Maybe they provide the routes of the busses via a different API call since > its relatively static. > > This will drive your solution more than the underlying technology. > > Oh and whileI focused on bus, there are also rail and other modes of > public transportation like light rail, trains, etc … > > HTH > > -Mike > > > On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi> > wrote: > > > Do you know any good examples how to use Spark streaming in tracking > public transportation systems ? > > Or Storm or some other tool example ? > > Regards > Esa Heikkinen > > 28.4.2016, 3:16, Michael Segel kirjoitti: > > Uhm… > I think you need to clarify a couple of things… > > First there is this thing called analog signal processing…. Is that > continuous enough for you? > > But more to the point, Spark Streaming does micro batching so if you’re > processing a continuous stream of tick data, you will have more than 50K of > tics per second while there are markets open and trading. Even at 50K a > second, that would mean 1 every .02 ms or 50 ticks a ms. > > And you don’t want to wait until you have a batch to start processing, but > you want to process when the data hits the queue and pull it from the queue > as quickly as possible. > > Spark streaming will be able to pull batches in as little as 500ms. So if > you pull a batch at t0 and immediately have a tick in your queue, you won’t > process that data until t0+500ms. And said batch would contain 25,000 > entries. > > Depending on what you are doing… that 500ms delay can be enough to be > fatal to your trading process. > > If you don’t like stock data, there are other examples mainly when pulling > data from real time embedded systems. > > > If you go back and read what I said, if your data flow is >> (much slower) > than 500ms, and / or the time to process is >> 500ms ( much longer ) you > could use spark streaming. If not… and there are applications which > require that type of speed… then you shouldn’t use spark streaming. > > If you do have that constraint, then you can look at systems like > storm/flink/samza / whatever where you have a continuous queue and listener > and no micro batch delays. > Then for each bolt (storm) you can have a spark context for processing the > data. (Depending on what sort of processing you want to do.) > > To put this in perspective… if you’re using spark streaming / akka / storm > /etc to handle real time requests from the web, 500ms added delay can be a > long time. > > Choose the right tool. > > For the OP’s problem. Sure Tracking public transportation could be done > using spark streaming. It could also be done using half a dozen other tools > because the rate of data generation is much slower than 500ms. > > HTH > > > On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > couple of things. > > There is no such thing as Continuous Data Streaming as there is no such > thing as Continuous Availability. > > There is such thing as Discrete Data Streaming and High Availability but > they reduce the finite unavailability to minimum. In terms of business > needs a 5 SIGMA is good enough and acceptable. Even the candles set to a > predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader > makes a sell or buy decision on the basis of 2 seconds candlestick > > The calculation itself in measurements is subject to finite error as > defined by their Confidence Level (CL) using Standard Deviation function. > > OK so far I have never noticed a tool that requires that details of > granularity. Those stuff from Flink etc is in practical term is of little > value and does not make commercial sense. > > Now with regard to your needs, Spark micro batching is perfectly adequate. > > HTH > > Dr Mich Talebzadeh > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > <http://talebzadehmich.wordpress.com/>http://talebzadehmich.wordpress.com > > > > On 27 April 2016 at 22:10, Esa Heikkinen <esa.heikki...@student.tut.fi> > wrote: > >> >> Hi >> >> Thanks for the answer. >> >> I have developed a log file analyzer for RTPIS (Real Time Passenger >> Information System) system, where buses drive lines and the system try to >> estimate the arrival times to the bus stops. There are many different log >> files (and events) and analyzing situation can be very complex. Also >> spatial data can be included to the log data. >> >> The analyzer also has a query (or analyzing) language, which describes a >> expected behavior. This can be a requirement of system. Analyzer can be >> think to be also a test oracle. >> >> I have published a paper (SPLST'15 conference) about my analyzer and its >> language. The paper is maybe too technical, but it is found: >> http://ceur-ws.org/Vol-1525/paper-19.pdf >> >> I do not know yet where it belongs. I think it can be some "CEP with >> delays". Or do you know better ? >> My analyzer can also do little bit more complex and time-consuming >> analyzings? There are no a need for real time. >> >> And it is possible to do "CEP with delays" reasonably some existing >> analyzer (for example Spark) ? >> >> Regards >> PhD student at Tampere University of Technology, Finland, www.tut.fi >> Esa Heikkinen >> >> 27.4.2016, 15:51, Michael Segel kirjoitti: >> >> Spark and CEP? It depends… >> >> Ok, I know that’s not the answer you want to hear, but its a bit more >> complicated… >> >> If you consider Spark Streaming, you have some issues. >> Spark Streaming isn’t a Real Time solution because it is a micro batch >> solution. The smallest Window is 500ms. This means that if your compute >> time is >> 500ms and/or your event flow is >> 500ms this could work. >> (e.g. 'real time' image processing on a system that is capturing 60FPS >> because the processing time is >> 500ms. ) >> >> So Spark Streaming wouldn’t be the best solution…. >> >> However, you can combine spark with other technologies like Storm, Akka, >> etc .. where you have continuous streaming. >> So you could instantiate a spark context per worker in storm… >> >> I think if there are no class collisions between Akka and Spark, you >> could use Akka, which may have a better potential for communication between >> workers. >> So here you can handle CEP events. >> >> HTH >> >> On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> please see my other reply >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 27 April 2016 at 10:40, Esa Heikkinen <esa.heikki...@student.tut.fi> >> wrote: >> >>> Hi >>> >>> I have followed with interest the discussion about CEP and Spark. It is >>> quite close to my research, which is a complex analyzing for log files and >>> "history" data (not actually for real time streams). >>> >>> I have few questions: >>> >>> 1) Is CEP only for (real time) stream data and not for "history" data? >>> >>> 2) Is it possible to search "backward" (upstream) by CEP with given time >>> window? If a start time of the time window is earlier than the current >>> stream time. >>> >>> 3) Do you know any good tools or softwares for "CEP's" using for log >>> data ? >>> >>> 4) Do you know any good (scientific) papers i should read about CEP ? >>> >>> >>> Regards >>> PhD student at Tampere University of Technology, Finland, www.tut.fi >>> Esa Heikkinen >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> The opinions expressed here are mine, while they may reflect a cognitive >> thought, that is purely accidental. >> Use at your own risk. >> Michael Segel >> michael_segel (AT) hotmail.com >> >> >> >> >> >> >> > > The opinions expressed here are mine, while they may reflect a cognitive > thought, that is purely accidental. > Use at your own risk. > Michael Segel > michael_segel (AT) hotmail.com > > > > > > > > > >