Regarding the scenario I am highlighting to show the problem of in-band or in-line provenance exfil what I was pointing out is not:
MiNiFi -> SystemA -> SystemB -> ... -> NiFi but rather it is MiNiFi -> SystemA MiNiFI -> SystemB Where the data being sent to A and B is happening in parallel (not series) and is actually the same piece of data for instance. This would look like a "fan out" graph. The current model that we've followed supports generation and transmission of the provenance graph regardless of the nature of the graph of how data flows within the system. The current approach we have for exfil of events is to leverage reporting tasks and this too has worked well. We can filter events in such tasks, we can manage bandwidth used, etc.. Can we rebase the discussion the problems we're trying to solve? That will help us better discuss solutions to those problems. If I look at the original thread I see "#4 and #5" being used to articulate what I think became the s2s alteration proposal. But I don't quite follow what #4 or #5 mean so can we restate/rephrase the core problem. Regarding ETL patterns and fundamental disagreement: It wasn't clear to me what part of the discussion that was referring to and I'm not familiar with the public papers you've released. Would be happy to read through to better understand your perspective. Can you share the links here? Regarding contributions and branching: I don't believe anyone has pushed back on your idea to provide an alternative implementation of the repositories. Please do feel free to contribute your alternative implementation. It would be great to be able to have both available and run side by side. This sort of pluggability also promotes good interface design to the repositories so it will be healthy regardless of what the outcome is. Regarding issues getting contributions into NiFi: Is there a specific engagement you've found has been left hanging? I see a couple of JIRAs and contribs you were involved in that culminated in merged commits and one that appears to have hit some snags and has not progressed. Is that the one you're talking about or are there other challenges? Let's take these cases and work through them. Thanks Joe On Tue, Nov 29, 2016 at 10:35 AM, Daniel Cave <[email protected]> wrote: > "Yes but there can be other hubs too and in parallel." > [Daniel]For MiNiFi C++ -> SystemA -> SystemB -> ... -> NiFi, if you dont > want provenance to travel then I don't see it as an issue since the outgoing > message would be identical to what you have now. If you feel it's going to > be extremely confusing then I could make it a new clone of the S2S MiNiFi > C++ processor, but I don't see a point to just hide a toggle. On the NiFi > side for this case you would use the normal S2S intake methods you use now. > No change. Also, if you're going from MiNiFi C++ -> SystemA there is no > change. > For MiNiFi C++ -> MiNiFi C++ ->....-> NiFi, if you want provenance travel > then yes you are locked into using n*(MiNiFi C++) -> NiFi with the > provenance toggled on and using the new S2S receiving processors in MiNiFi > C++/NiFi (it has to be a new one to avoid backwards compatibility issues) > that can handle provenance. Again, I don't see this as an issue either > since you are clearly wanting this functionality if you're doing this. > Am I missing something in my logic flow that you are seeing that I need to > account for? > > "You've mentioned this a couple times now. " > [Daniel] Agreed and this is how this discussion is meant to be taken. > > "I'm not quite sure I understand so please elaborate if my > comments don't apply." > [Daniel]It has to do with when and how it's consumed. On current path Atlas > won't answer the issues, but as you said there are others and I have my own > in progress as well. I fundamentally disagree with the current > sink-retrieve-sink ETL paradigm (as you've seen from my public papers, there > are others not public yet as well) as it is a complete waste of time and > resources at this point. In all my work, data is handled as available (near > real-time) rather than waiting for some ETL processes to run at some > arbitrary point in the future. By doing this you avoid unnecessary traffic, > storage, processing, maintenance, and design all while improving data > availability. More specifically to this discussion, the issue comes down to > access from the point of origin. In an embedded or background instance of > MiNiFi C++, bidirectional followup calls for provenance only are not always > going to be available. Additionally, where they are available they are not > going to be current and hence are fairly useless for security applications. > Think of trying this on your laptop, IoT devices, or on financial > transactions. If I find out 12-36hrs later when you reconnect or I can send > someone to the field to retrieve it or the ETL processes run that there was > an issue, it doesn't do me any good. As Randy mentioned, you can recombine > all this later, however it is a very resource consuming process. There is > no reason not to have it available when the data is available since it's > just a matter of allowing for its transfer in line with the data. NiFi is > not assuming responsibility for anything it doesn't already, this just > extends it's reach to the full NiFi/MiNiFi instance so there should not be > an ownership concern. This requires an extremely minor update in NiFi, but > is for a fundamental need in MiNiFi C++. > > "Ok so I think what you're saying is" > [Daniel] Right, and since you can just disable it if you don't need it there > is no performance or bandwidth hit unless you enable it. > > "It is really important to propose and advocate" > [Daniel] I don't see this as a model change, as per my previous questions > MiNiFi C++ seems to not yet have a solid model as the time and effort is > being mainly being put into MiNiFi Java. Since I have very specific ideas > around MiNiFi C++ (and have discussed them with you last year and others at > HW when MiNiFi was only going to be in C) I have not seen this as a radical > departure but an elaboration on what we had already discussed. If you or > the community wants to go a different path, I have no issue branching and > going a separate way with these and the LevelDB changes rather than > introducing these changes into the current path. Being OpenSource there is > no right answer, so I'm certainly open to any suggestions, but I think > you'll find what I'm proposing here is going to be important when you get to > actual implementations of it and it's easier to change now than when you're > locked in later, especially given my issues getting our contributions into > NiFi. As stated above, I don't see how this affects any other > implementations or use cases of MiNiFi C++/NiFi as proposed. > > > > > -- > View this message in context: > http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14048.html > Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
