Perhaps you've thought of this already, but it sounds like streaming relational algebra could be a good fit here.
https://calcite.apache.org/docs/stream.html -- Michael Mior [email protected] Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <[email protected]> a écrit : > Hi Calcite-devs, > > I just had a very interesting mail exchange with Julian (Hyde) on the > incubator list [1]. It was about our project CRUNCH (which is mostly about > time series analyses and signal processing) and its relation to relational > algebra and I wanted to bring the discussion to this list to continue here. > We already had some discussion about how time series would work in calcite > [2] and it’s closely related to MATCH_RECOGNIZE. > > But, I have a more general question in mind, to ask the experts here on > the list. > I ask myself if we can see the signal processing and analysis tasks as > proper application of relational algebra. > Disclaimer, I’m mathematician, so I know the formals of (relational) > algebra pretty well but I’m lacking a lot of experience and knowledge in > the database theory. Most of my knowledge there comes from Calcites source > code and the book from Garcia-Molina and Ullman). > > So if we take, for example, a stream of signals from a sensor, then we can > of course do filtering or smoothing on it and this can be seen as a mapping > between the input relation and the output relation. But as we usually need > more than just one tuple at a time we lose many of the advantages of the > relational theory. And then, if we analyze the signal, we can again model > it as a mapping between relations, but the input relation is a “time > series” and the output relation consists of “events”, so these are in some > way different dimensions. In this situation it becomes mostly obvious where > the main differences between time series and relational algebra are. Think > of something simple, an event should be registered, whenever the signal > switches from FALSE to TRUE (so not for every TRUE). This could also be > modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels > “unnatural” because we cannot use any indices (we don’t care about the > ratio of TRUE and FALSE in the DB, except for probably some very rough > outer bounds). And we are lacking the “right” information for the optimizer > like estimations on the number of analysis results. > It gets even more complicated when moving to continuous valued signals > (INT, DOUBLE, …), e.g., temperature readings or something. > If we want to analyze the number of times where we have a temperature > change of more than 5 degrees in under 4 hours, this should also be doable > with MATCH_RECOGNIZE but again, there is no index to help us and we have no > information for the optimizer, so it feels very “black box” for the > relational algebra. > > I’m not sure if you get my point, but for me, the elegance of relational > algebra was always this optimization stuff, which comes from declarative > and ends in an “optimal” physical plan. And I do not see how we can use > much of this for the examples given above. > > Perhaps, one solution would be to do the same as for spatial queries (or > the JSON / JSONB support in postgres, [3]) to add specialized indices, > statistics and optimizer rules. Then, this would make it more “relational > algebra”-esque in the sense that there really is a possibility to apply > transformations to a given query. > > What do you think? Do I see things to complicated or am I missing > something? > > Julian > > [1] > https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E > [2] > https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E > [3] https://www.postgresql.org/docs/9.4/datatype-json.html > >
