@greg which library of trie you were thinking of ? There is one in the commons-collection I see.
On Fri, May 17, 2024 at 3:55 PM Claude Warren <cla...@xenei.com> wrote: > > > > This has implications for execution complexity: If we can't compute > > whether two patterns overlap, then we need to run both of them on each > > piece of input to test if they both match. Under the current > > LITERAL/PREFIX system, we can optimize execution with a trie, but that > > option wouldn't be available to us with MATCH. > > > > If we consider the case of an asterisk representing 1 or more characters > then determining if 2 patterns overlap can be fairly simple and very fast. > Let's call the text from the ACL the target, and the text from the wildcard > matcher the candidate. > > When a wildcard pattern, excluding '*', is created: > > - the candidate text is broken into fragments separated by characters > that are not Character.isLetterOrDigit() (See > https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html). > - fragments that contain 1 character are ignored. > - fragments that contains 2 or more characters are scanned and every > every pair of characters used to create a Hasher (See > > https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/bloomfilter/Hasher.html > ) > that hasher is added to a Bloom filter associated with the wildcard > pattern. > > When a target is being evaluated and matching wildcard entries are to be > located. Split and create a bloom filter using entry and the same strategy > as for the wildcard patterns above. These bloom filters will have had more > pairs of characters added than for a matching wildcard pattern. > > Each filter contains the pattern for the unique pairs in the fragments of > the original text: > > Now we can check the Bloom filters. > > To find potential matching patterns we can check to see if the Bloom filter > for the pattern is contained within the bloom filter for the entry text. > If so then we know that it is likely that the pairs of characters specified > in the wild card (non-wildcard text) appear in the entry text. > > At this point we can evaluate the reduced number of patterns to see which > ones actually match. > > Having reduced the candidates to the matching patterns we can then use a > standard measure of similarity like the Levenshtein distance to determine > which candidates require the fewest edits to evolve into the target. The > one with the fewest edits is the most specific and should be applied. > > Now if you want to know which patterns overlap you have a similar process. > > Each Bloom filter can calculate the estimated number of unique fragments > that were added to it. If the filters are sorted by this estimation, the > ones with the highest counts may contain the ones with the lowest counts, > but a lower count can never contain a higher count. The calculation to > perform the estimation can be skipped and the cardinality value of each > Bloom filter can be used instead. You can then check the smaller filters > against the larger ones and find where one candidate is contained within > another. > > If you want to know if they intersect the BloomFilter from > commons-collections has an estimateIntersection as well as an estimateUnion > method. Quick tests can be made between filters to see if there is any > overlap before more complex analysis is performed. > > The solution here does not make the final comparisons easier, it simply > reduces the search space to find the items that need to be compared. > > > On Mon, May 6, 2024 at 9:03 PM Greg Harris <greg.har...@aiven.io.invalid> > wrote: > > > Hi Murali, > > > > Thanks for the KIP! > > > > I think I understand the motivation for this KIP in situations where > > there are a "cross product" of topics for two or more variables X and > > Y, and want to write ACLs for each of the variable axes. > > If you format your topics "X-Y-suffix", it's not easy to write rules > > that apply to all "Y" topics, because you need to enumerate all of the > > "X" values, and the problem persists even if you reorder the topic > > name. > > > > In my recent work on KIP-986 I found it necessary to introduce > > "namespaces" to group topics together, and I was going to replicate > > the ACL system to specify those namespaces. This change to the ACL > > system could increase the expressiveness and complexity of that > > feature, if it is ever implemented. > > One of the primitives I needed when specifying namespaces was the > > ability to tell when two namespaces overlapped (i.e. does there exist > > any topic which is present in both namespaces). This is trivial to do > > with the current PREFIX and LITERAL system, as we can find the > > maximum-length common prefix with just some length comparisons and > > equality checks. > > I considered specifying namespaces via regular expressions, and found > > that it was computationally much more difficult. Computing the > > intersection of two regexes appears to be exponential in the length of > > the regexes, leading me to avoid adding it. > > > > I understand that you're not suggesting full REGEX support, and that > > "namespaces" don't need to support MATCH, but I think MATCH may run > > into related difficulties. Any MATCH can overlap with any other MATCH > > or PREFIX if it includes a sufficient number of wildcards. For > > example: > > MATCH *-accounts-* has overlap with PREFIX nl as they can both match > > "nl-accounts-localtopic", but that isn't sensitive to the contents > > "nl", it is true for any PREFIX. > > MATCH *-accounts-* has overlap with MATCH *localtopic, as they can > > both match "nl-accounts-localtopic", but that isn't actually sensitive > > to the contents "localtopic", it's true for any MATCH which includes a > > wildcard at the beginning. > > > > This has implications for execution complexity: If we can't compute > > whether two patterns overlap, then we need to run both of them on each > > piece of input to test if they both match. Under the current > > LITERAL/PREFIX system, we can optimize execution with a trie, but that > > option wouldn't be available to us with MATCH. > > > > The current system makes users evaluate a trade-off: > > 1. Optimize the number of ACLs by organizing topics according to > > prefixes (for example, "accounts-localtopic-nl" and PREFIX "accounts", > > PREFIX "accounts-localtopic") > > 2. Use less-structured topic names, with a corresponding ACL scheme > > that has more individual rules. > > The system currently informs users of this tradeoff by making them > > write multiple ACLs, and making them think "there has got to be a > > better way!". Perhaps we can find a better way to surface this best > > practice, or better inform users about it. > > > > I understand that there are going to be situations more complex than > > your example, where multiple individual rules will always be necessary > > with only PREFIX evaluation. I think even in those situations, a > > number of efficient-to-evaluate rules is preferable to just one > > expensive-to-evaluate rule. > > > > One alternative that I thought of could be "PARAMETERIZED" ACLs which > > are like PREFIXED, but allow some parameter substitution. For example > > PARAMETERIZED "(nl|de|cz)-accounts-". I'm lifting regex syntax here, > > but this isn't actually a regex, and wouldn't allow arbitrary numbers > > of characters, or the * or + operators. > > In the background it could evaluate exactly like the 3 individual > > PREFIX rules, but be easier to evaluate on the backend, and support > > the intersection query I mentioned earlier. It could also support > > [a-zA-Z] notation in case the parameter values aren't known ahead of > > time, but have a fixed length. > > > > Thanks, > > Greg > > > > On Mon, May 6, 2024 at 11:17 AM Claude Warren <cla...@xenei.com> wrote: > > > > > > I have an idea for how to reduce the time for ACL lookups in general > and > > > particularly where wildcards are involved using sequence > > > characterization techniques from bioinformatics. But I need a set of > ACL > > > patterns and associated topics to test with. > > > > > > On Fri, May 3, 2024 at 2:45 PM Haruki Okada <ocadar...@gmail.com> > wrote: > > > > > > > Hi, Murali. > > > > > > > > First, could you add the KIP-1042 to the index ( > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals > > > > ) > > > > as well so that everyone can find it easily? > > > > > > > > I took a look at the KIP, then I have 2 questions: > > > > > > > > 1. Though the new MATCH resource pattern type may reduce the effort > of > > > > adding ACLs in some cases, do you have any concrete use case you are > in > > > > mind? (When prefixed ACL was introduced in KIP-290, there was a > > use-case > > > > that using it for implementing multi-tenancy) > > > > > > > > 2. As you may know, ACL lookup is in the hot-path which the > > performance is > > > > very important. ( > > > > > > > > > > > https://github.com/apache/kafka/blob/240243b91d69c2b65b5e456065fdcce90c121b04/core/src/main/scala/kafka/security/authorizer/AclAuthorizer.scala#L539 > > > > ). > > > > Do you have ideas how do we update `matchingAcls` to support > > MATCH-type ACL > > > > without introducing performance issue? > > > > > > > > > > > > Thanks, > > > > > > > > 2024年5月3日(金) 19:51 Claude Warren, Jr <claude.war...@aiven.io > .invalid>: > > > > > > > > > As I wrote in [1], the ACL evaluation algorithm needs to be > specified > > > > with > > > > > respect to the specificity of the pattern so that we know exactly > > which > > > > if > > > > > *-accounts-* takes precedence over nl-accounts-* or visa versa. > > > > > > > > > > I think that we should spell out that precedence is evaluated as > > follows: > > > > > > > > > > 1. Remove patterns that do not match > > > > > 2. More specific patterns take precedence over less specific > patterns > > > > > 3. for patterns of the same precedence DENY overrides ALLOW > > > > > > > > > > Determining specificity: > > > > > > > > > > Specificity is based on the Levenshtein distance between the > pattern > > and > > > > > the text being evaluated. The lower the distance the more specific > > the > > > > > rule. > > > > > Using the topic name: nl-accounts-localtopic we can evaluate the > > > > > Levenshtein distance for various patterns. > > > > > nl-accounts-localtopic = 0 > > > > > *-accounts-localtopic = 2 > > > > > nl-accounts-local* = 5 > > > > > *-accounts-local* = 7 > > > > > nl-accounts-* = 10 > > > > > *-accounts-* = 12 > > > > > > > > > > In the special case of matching principles User matches are more > > specific > > > > > than Group matches. > > > > > > > > > > I don't know if this should be added to KIP-1042 or presented as a > > new > > > > KIP. > > > > > > > > > > Claude > > > > > > > > > > [1] > https://lists.apache.org/thread/0l88tkbxq3ol9rnx0ljnmswj5y6pho1f > > > > > < > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1042%3A+Support+for+wildcard+when+creating+new+acls > > > > > > > > > > > > > > > > On Fri, May 3, 2024 at 12:18 PM Claude Warren <cla...@xenei.com> > > wrote: > > > > > > > > > > > Took me awhile to find it but the link to the KIP is > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1042%3A+Support+for+wildcard+when+creating+new+acls > > > > > > > > > > > > On Fri, May 3, 2024 at 10:13 AM Murali Basani < > > murali.bas...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > I'd like to propose a suggestion to our resource patterns in > > Kafka > > > > > ACLs. > > > > > > > > > > > > > > Currently, when adding new ACLs in Kafka, we have two types of > > > > resource > > > > > > > patterns for topics: > > > > > > > > > > > > > > - LITERAL > > > > > > > - PREFIXED > > > > > > > > > > > > > > However, when it comes to listing or removing ACLs, we have a > > couple > > > > > more > > > > > > > options: > > > > > > > > > > > > > > - MATCH > > > > > > > - ANY (will match any pattern type) > > > > > > > > > > > > > > > > > > > > > If we can extend creating acls as well with 'MATCH' pattern > > type, it > > > > > > would > > > > > > > be very beneficial. Even though this kind of acl should be > > created > > > > with > > > > > > > utmost care, it will help organizations streamline their ACL > > > > management > > > > > > > processes. > > > > > > > > > > > > > > Example scenarios : > > > > > > > > > > > > > > Let's say we need to create ACLs for the following six topics: > > > > > > > nl-accounts-localtopic, nl-accounts-remotetopic, > > > > > de-accounts-localtopic, > > > > > > > de-accounts-remotetopic, cz-accounts-localtopic, > > > > > cz-accounts-remotetopic > > > > > > > > > > > > > > Currently, we achieve this using existing functionality by > > creating > > > > > three > > > > > > > prefixed ACLs as shown below: > > > > > > > > > > > > > > kafka-acls --bootstrap-server localhost:9092 \ > > > > > > > > --add \ > > > > > > > > --allow-principal > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > User:CN=serviceuser,OU=ServiceUsers,O=Unknown,L=Unknown,ST=Unknown,C=Unknown > > > > > > > > \ > > > > > > > > --producer \ > > > > > > > > --topic nl-accounts- \ > > > > > > > > --resource-pattern-type prefixed > > > > > > > > > > > > > > > > > > > > > kafka-acls --bootstrap-server localhost:9092 \ > > > > > > > > --add \ > > > > > > > > --allow-principal > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > User:CN=serviceuser,OU=ServiceUsers,O=Unknown,L=Unknown,ST=Unknown,C=Unknown > > > > > > > > \ > > > > > > > > --producer \ > > > > > > > > --topic de-accounts- \ > > > > > > > > --resource-pattern-type prefixed > > > > > > > > > > > > > > > > > > > > > kafka-acls --bootstrap-server localhost:9092 \ > > > > > > > > --add \ > > > > > > > > --allow-principal > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > User:CN=serviceuser,OU=ServiceUsers,O=Unknown,L=Unknown,ST=Unknown,C=Unknown > > > > > > > > \ > > > > > > > > --producer \ > > > > > > > > --topic cz-accounts- \ > > > > > > > > --resource-pattern-type prefixed > > > > > > > > > > > > > > > > > > > > > However, if we had the 'MATCH' pattern type available, we could > > > > > > accomplish > > > > > > > this with a single ACL, as illustrated here: > > > > > > > > > > > > > > kafka-acls --bootstrap-server localhost:9092 \ > > > > > > > > --add \ > > > > > > > > --allow-principal > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > User:CN=serviceuser,OU=ServiceUsers,O=Unknown,L=Unknown,ST=Unknown,C=Unknown > > > > > > > > \ > > > > > > > > --producer \ > > > > > > > > --topic *-accounts-* \ > > > > > > > > --resource-pattern-type match > > > > > > > > > > > > > > > > > > > > > > > > > > > > This pattern closely resembles PREFIXED but offers broader > > allow/deny > > > > > > > rules. > > > > > > > > > > > > > > Implementing this change could significantly reduce the effort > in > > > > > several > > > > > > > acl management processes. > > > > > > > > > > > > > > I welcome your thoughts and any concerns you may have regarding > > this > > > > > > > proposal. > > > > > > > > > > > > > > Thanks, > > > > > > > Murali > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > LinkedIn: http://www.linkedin.com/in/claudewarren > > > > > > > > > > > > > > > > > > > > > > > -- > > > > ======================== > > > > Okada Haruki > > > > ocadar...@gmail.com > > > > ======================== > > > > > > > > > > > > > -- > > > LinkedIn: http://www.linkedin.com/in/claudewarren > > > > > -- > LinkedIn: http://www.linkedin.com/in/claudewarren >