> I also think that it is frustrating for reviewers when concerns are not and > But there are strong and serious objections, not just from me, around the > technical approach. These objections have not been addressed.
I generally agree to these statements regarding frustration from unresolved concerns/objections, as I also maintain other Open Source projects - but I’d like to see any comments on any mailing thread or PRs regarding this proposal where this is the case. I’ll apologize in advance if I missed anything - but I believe, all objections had been duly responded to and/or resolved in a prompt manner. > As a project we should strive for solutions that our users can safely use > without having to understand complex details. Adding more knobs that users > must understand before they can use Polaris makes the adoption of Polaris too > difficult. The most successful projects and products do not have any > mandatory configuration options that require reading and understanding (lots > of) documentation to get started, solely because things are self-explaining > and easy to use. I agree with you here - mandatory configuration options should not be required for this feature; they always increase the user’s mental model complexity and should be used as sparingly as possible. However, I am not introducing any mandatory configuration options here; configuring the one line of which Event Listener implementation the user wants already exists. All configuration options regarding the Event Listener implementations have conservative default values (as I stated in the previous email) that only “power-users” will want to tinker with. Providing additional optional configurations for users rarely increase the general user’s mental model complexity as general users will not need/use these options. > On top: there is the tight relation to the Iceberg proposal having serious > implications to the persistence (writes and queries) of these events: the > size of table/view metadata and updates to those, the presence of these > per-event attributes having a huge payload is required by the Iceberg > proposal as it stands. You raised that concern as well. There is still no > answer to or consensus on that yet. As I stated in my previous email, I believe we have identified workarounds for this - and I don’t see that these workarounds are entirely unreasonable. To be clear, let me list some of these workarounds out in more details here and we can debate why none of these workarounds ultimately can work for Polaris: * Don’t store fields within payloads above a certain size, such as the major concern of `TableMetadata` objects. Show back `null` to users who query for this particular event. * Truncate the specific fields within larger objects, such as the `snapshot` field within `TableMetadata` objects if the overall object size exceeds a certain size. > Auditing is a mechanism to (later) inspect what happened, who did what, etc > etc., which means that auditing has strong consistency and ordering > requirements. These requirements can, in general, not be met with the > guarantees mentioned in the Iceberg proposal, and in particular not with the > proposed implementation. I’d like to peel this back a bit: what do you mean by "ordering requirements"? As I mentioned in my previous email, ordering requirements for showing events back to customers is still possible - but not ordering guarantees for event ingestion. And I don’t see ordering guarantees on event ingestion as a requirement and have still not gotten any good reasoning why this should be the case. For consistency, I believe we have done the best we can with the proposed implementation for an open-source deployment. I would actually say it is almost impossible to guarantee SLAs or consistency for any sort of open-source deployment where the open-source project does not control all infrastructure underneath or around the open-source project. It is exactly for this reason why many users go towards managed deployments of open source projects where other companies are paid for upkeep to all infrastructure and are on the hook for providing such guarantees. But, I do not think this then means that there cannot be a best-effort, out-of-the-box experience that an open-source project gives as an optional feature. If you feel that there is some design decision that I’ve made that has severe ramifications for the customer, let’s definitely discuss those specifically rather than general commentary that does not move this proposal forward. > Some questions I find very useful to think about before starting with a > technical approach: > * What happens if the process crashes? > * What happens during network hiccups? > * What happens to the server/disk/database/network/service if there a (too) > many concurrent requests? > * What happens during and after a STW GC phase? > * How do things behave in a horizontally scaled setup? > * How do things behave in a geo-distributed setup? > * How do things recover from x/y/z? > There are more questions, but those are important detailed ones. Those are > not "isolated" questions, but rather relate to each other. And those are just > the lower level ones, not even the higher level ones like UX, use cases, > SLAs, etc. Thanks for the specific questions here - I’m glad to answer all of them in the context of the proposed PR that’s been introduced and specifically with the file-based buffer implementation. The in-memory buffer implementation obviously has large drawbacks for consistency but is there to serve users who absolutely will not accept any writing to the disk but would still like best-effort events. > * What happens if the process crashes? Events remain in the file buffers and will be flushed to persistence once Polaris and/or the Persistence recover. > * What happens during network hiccups? > * What happens to the server/disk/database/network/service if there a (too) > many concurrent requests? File buffers will not delete the buffered events until they have been successfully flushed to persistence. No events will be lost and will continue to be retried. In the case of incoming events being generated, they are simply written onto disk and the optional customer configurations to dump to persistence after a certain amount of events will help bail users out in case too many events are generated too quickly. > * What happens during and after a STW GC phase? In the current implementation, the same thing that would happen to the service as a whole - everything pauses. The threads running as part of the file-buffer event listener will also be paused and will resume when the JVM allows it to alongside the rest of the service. > * How do things behave in a horizontally scaled setup? > * How do things behave in a geo-distributed setup? Each buffer event listener operates only on its own Polaris instance. Each buffer event listener is, as a result, also responsible for only its own set of buffers. As long as it can connect to the persistence instance that it is supposed to communicate (which is a base requirement for the entire Polaris service in general), there are no issues in managing a horizontally-scaled or geo-distributed set of buffers. However, this does give me one good idea - what do we do in case this particular Polaris instance shuts down? I believe we can add a shut-down cleanup for these buffers. Please make a comment on the PR regarding this and I can begin investigation on this. > PS: I can go into the nitty gritty details, but I think it's worth to > consider the above first. I believe I’ve responded to all the above questions/concerns. Please do go deeper into the details - I’d prefer as little conversation latency as possible, so please do list all concerns as thoroughly as you can. Going through fractional concerns little-by-little will only make our time to resolve concerns unnecessarily longer. Best, Adnan Hemani > On Jun 20, 2025, at 6:01 AM, Robert Stupp <sn...@snazy.de> wrote: > > Let me first second your point on frustration about long standing proposals - > I completely feel that pain. I also think that it is frustrating for > reviewers when concerns are not addressed. But it is also worth noting that > getting to a consensus takes time. Getting something into an OSS project can > take a very long time. People have many things on their plate, not only the > "review of my particular effort". This is very different from "single entity > owned" and "closed source" projects. It is important to keep that in mind. > > As a project we should strive for solutions that our users can safely use > without having to understand complex details. Adding more knobs that users > must understand before they can use Polaris makes the adoption of Polaris too > difficult. The most successful projects and products do not have any > mandatory configuration options that require reading and understanding (lots > of) documentation to get started, solely because things are self-explaining > and easy to use. > > Nobody objects the effort on having events in Polaris. > > But there are strong and serious objections, not just from me, around the > technical approach. These objections have not been addressed. > > On top: there is the tight relation to the Iceberg proposal having serious > implications to the persistence (writes and queries) of these events: the > size of table/view metadata and updates to those, the presence of these > per-event attributes having a huge payload is required by the Iceberg > proposal as it stands. You raised that concern as well. There is still no > answer to or consensus on that yet. > > Auditing is a mechanism to (later) inspect what happened, who did what, etc > etc., which means that auditing has strong consistency and ordering > requirements. These requirements can, in general, not be met with the > guarantees mentioned in the Iceberg proposal, and in particular not with the > proposed implementation. > > Some questions I find very useful to think about before starting with a > technical approach: > * What happens if the process crashes? > * What happens during network hiccups? > * What happens to the server/disk/database/network/service if there a (too) > many concurrent requests? > * What happens during and after a STW GC phase? > * How do things behave in a horizontally scaled setup? > * How do things behave in a geo-distributed setup? > * How do things recover from x/y/z? > There are more questions, but those are important detailed ones. Those are > not "isolated" questions, but rather relate to each other. And those are just > the lower level ones, not even the higher level ones like UX, use cases, > SLAs, etc. > > Building technically challenging things is often fun and probably the reason > why we build code. Our users only want things to be "boring" in the sense of > "it just works". Our job is it to make the complex and complicated things > boring. > > Robert > > PS: I can go into the nitty gritty details, but I think it's worth to > consider the above first. > > > On 19.06.25 04:39, Adnan Hemani wrote: >> First of all, sorry all for the misformating in my previous email - seems my >> mail client wasn’t playing well with the Apache mail server. >> >>> Adding that there's been consensus in the meeting to start with a pure Java >>> interface and go from there. >> I’m not sure what this means - can you expand on this? I can't agree there’s >> a consensus on this unless we’re all clear as to what this actually means. >> >>> I'm not sure that the statement "Ordering guarantees are **only** possible >>> ... event creation time" (emphasis mine) is correct. >> This is in context of the implementation that was shared on the PR. In that >> context, there is no guarantee on ordering - and I don’t agree that there is >> a good reason why this is a hard requirement when it is still possible to >> show results in an ordered manner to users. >> >>> During the meeting I mentioned that I strongly believe that it is not a >>> good idea to let users (as Apache Polaris we do NOT have customers) figure >>> out constraints and limitations and issues on their own. >> Your objection is noted; but event listeners are inherently configurable by >> users - and this level of “power-user features” is and always should be >> configurable by the user themselves. As part of my implementation, I’ve >> provided very conservative default values - but users are free to modify as >> they require. I don’t think there is anything wrong against providing >> reasonable defaults and then letting users decide based on their knowledge >> of their usage patterns. >> >>> Neither the Iceberg proposal nor PR 1844 are suitable for the auditing use >>> case, because auditing is a _very_ different beast and implies way more >>> requirements than "just store some data w/o any consistency guarantees". >> Happy to discuss what other requirements are not being fulfilled here for >> auditing use cases. The way I look at it, this implementation provides the >> most reasonable, resilient implementation without creating a crazy amount of >> infrastructure for an out-of-the-box experience. An implementation involving >> Kafka/message broker to fire-and-forget events (and rely on Kafka to sink >> the events properly) would, of course, be much more resilient - but no >> longer works out of the box for Apache Polaris. And to underscore the above >> point - the choice of which event listener to use is again on the user. They >> may choose to use one of the implementations out of the box - or write one >> of their own - or not to enable events/auditing altogether. It should >> ultimately, always be, the user's choice. >> >>> There are still concerns mentioned in the Iceberg events proposal, with a >>> huge impact to this effort. So I strongly believe that both efforts are >>> very tightly coupled and not orthogonal efforts that could be handled >>> independently. For example, I do not see how the "metadata payload size >>> concern" is mitigated in PR 1844. We had the same discussion around >>> "caching Iceberg metadata in Polaris". >> This is something I’ve brought up in the Iceberg proposal as well and would >> like to see changed as well; but ultimately, this is not a change that >> forces a drastic change regardless of the outcome. We have a variety of >> workarounds on this particular topic (such as not storing the larger >> metadata payloads altogether and showing a null object to the user - or >> truncating the larger fields, etc.) I don’t find making a decision on this >> as required for an MVP change - but if you feel that this is the case, >> please bring all topics that fit this criteria and I am happy to discuss and >> service them as required. >> >>> During the meeting some people raised concerns about the "buffering" that >>> is strictly speaking not a necessity, but also introduced in 1844. That >>> introduces additional consistency issues, additional risk of losing events >>> and additional ordering issues. That is a very different problem than "just >>> storing some data". >> I believe this was mostly Alex that brought up this concern; as per a >> different mailing thread, we had discussed this already: >> https://www.google.com/url?q=https://lists.apache.org/thread/fqfsy03855rv3mwscol3qnxnf4xcnc3v&source=gmail-imap&ust=1751029395000000&usg=AOvVaw3aogakcQ4pmRcpduveArNI. >> I still don’t agree that we can write back to the persistence on >> read-generated events as well on a per-call basis. But if that’s what it >> will take to get this PR merged, I’m happy to remove the buffer >> implementations and write back to persistence on every event generated. What >> I will not take responsibility for, in that case, is potential user >> complaints on latency and DB load once they enable writing back to the >> persistence (on either write-only or read-write event generation). Please >> let me know how you and Alex would like to proceed. >> >> To also add context, the above Mailing List thread has been open for over a >> month detailing all of this and did not receive these concerns at any time >> regarding this. It is immensely frustrating that contributors follow all the >> processes recommended - yet still end up with the possibility of wasted >> efforts at the 11th hour. >> >> Best, >> Adnan Hemani >> >>> On Jun 18, 2025, at 3:40 AM, Robert Stupp <sn...@snazy.de >>> <mailto:sn...@snazy.de>> wrote: >>> >>> Adding that there's been consensus in the meeting to start with a pure Java >>> interface and go from there. >>> >>> I'm not sure that the statement "Ordering guarantees are **only** possible >>> ... event creation time" (emphasis mine) is correct. >>> >>> During the meeting I mentioned that I strongly believe that it is not a >>> good idea to let users (as Apache Polaris we do NOT have customers) figure >>> out constraints and limitations and issues on their own. >>> >>> Neither the Iceberg proposal nor PR 1844 are suitable for the auditing use >>> case, because auditing is a _very_ different beast and implies way more >>> requirements than "just store some data w/o any consistency guarantees". >>> >>> There are still concerns mentioned in the Iceberg events proposal, with a >>> huge impact to this effort. So I strongly believe that both efforts are >>> very tightly coupled and not orthogonal efforts that could be handled >>> independently. For example, I do not see how the "metadata payload size >>> concern" is mitigated in PR 1844. We had the same discussion around >>> "caching Iceberg metadata in Polaris". >>> >>> During the meeting some people raised concerns about the "buffering" that >>> is strictly speaking not a necessity, but also introduced in 1844. That >>> introduces additional consistency issues, additional risk of losing events >>> and additional ordering issues. That is a very different problem than "just >>> storing some data". >>> >>> >>> On 17.06.25 21:16, Adnan Hemani wrote: >>>> Hi everyone, >>>> >>>> In lieu of a recording of today’s Community Sync on Events, I am posting >>>> some notes regarding what was discussed: >>>> What is the relationship between Iceberg Events API and Polaris Events, >>>> which are proposed in >>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/apache/polaris/pull/1844%26source%3Dgmail-imap%26ust%3D1750848131000000%26usg%3DAOvVaw37B6mpoGEqTLWUurMAUCzZ&source=gmail-imap&ust=1751029395000000&usg=AOvVaw2BRYRRCSwzwpx8jP9ckdDR? >>>> Persisting Polaris events are a pre-requisite of the Iceberg Events API - >>>> but are not strictly tied to this. Users could find value in being able to >>>> persist the Polaris Events without using the Iceberg Events API. >>>> What Query Patterns are we expecting? >>>> Going based on the assumption that the Iceberg Events API will be a >>>> primary consumer of the Polaris Events and that it is almost finalized. >>>> The proposed data schema for events is designed to work efficiently with >>>> the current state of the Iceberg Events API. >>>> What’s the Intended Use-Case? >>>> This will go out in a different email later today under the original >>>> proposal thread to ensure all context is in the same email thread. >>>> If auditing is a potential use-case, then what guarantees are we able to >>>> provide? >>>> Ordering guarantees are only possible in that the event creation time is >>>> listed with the Polaris Event. When querying Polaris Events from the >>>> database, we can always sort events based on this timestamp. >>>> Durability guarantees can be found in some implementations - but this is >>>> up to the customer to choose which implementation they choose and how >>>> they’d like to configure that implementation. All of these configurations >>>> are present in the PR as it stands today. >>>> A potential Kafka implementation may help with these concerns - but lacks >>>> an end-to-end customer experience within Polaris and may be pushing the >>>> concerns forward to Kafka rather than solving them. Unsure how this may >>>> work with Iceberg Events API in the future. >>>> Can the PR be broken up further? >>>> Yes, it is possible - but unclear what parts are not necessary at this >>>> time. Community to review and make suggestions on the PR. >>>> >>>> Next Steps/Action Items: >>>> Community: to review PR as it stands and provide high-level >>>> recommendations/suggestions >>>> Adnan Hemani: Send email regarding intended use cases. >>>> Adnan Hemani: To respond to all reviews on PRs. >>>> >>>> Please do respond to this email with anything I may have missed out on! >>>> Thanks to everyone who was able to make it to this morning’s sync and for >>>> everyone’s contributions :) >>>> >>>> Best, >>>> Adnan Hemani >>>> >>>> >>>>> On Jun 13, 2025, at 4:43 PM, Adnan Hemani <adnan.hem...@snowflake.com >>>>> <mailto:adnan.hem...@snowflake.com>> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> As we were not able to discuss at the previous community sync, I’m >>>>> setting a quick sync early next week to discuss Events in Persistence >>>>> (PR: >>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/apache/polaris/pull/1844%26source%3Dgmail-imap%26ust%3D1750848131000000%26usg%3DAOvVaw37B6mpoGEqTLWUurMAUCzZ&source=gmail-imap&ust=1751029395000000&usg=AOvVaw2BRYRRCSwzwpx8jP9ckdDR). >>>>> Everyone is welcome to join and discuss on next steps here. Thanks! >>>>> >>>>> Best, >>>>> ADNAN HEMANI >>>>> >>>>> Polaris Community Sync on Events >>>>> Tuesday, June 17 · 9:00 – 9:30am >>>>> Time zone: America/Los_Angeles >>>>> Google Meet joining info >>>>> Video call link: >>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://meet.google.com/ear-kiij-sur%26source%3Dgmail-imap%26ust%3D1750848131000000%26usg%3DAOvVaw1UuJNMFk1OBP1M37cak4Jy&source=gmail-imap&ust=1751029395000000&usg=AOvVaw1-MKRhtyHWHeRQdo6b5RNz >>>>> Or dial: (US) +1 402-410-2280 PIN: 350 919 847# >>>>> More phone numbers: >>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://tel.meet/ear-kiij-sur?pin%253D5036846369686%26source%3Dgmail-imap%26ust%3D1750848131000000%26usg%3DAOvVaw3C_51S6TE4965AiPLaVe1A&source=gmail-imap&ust=1751029395000000&usg=AOvVaw2UTLLmxN5IXik3klT4a6xc