Hi Ryan & Josh, Just joined the email list. I was pointed to the previous discussion on Bloom filters that recently came up and am belatedly responding.
I believe we have a use-case that warrants the use of Bloom Filters, but would like to hear your(and others) feedback. The use-case is General Data Protection Regulation (GDPR) that recently went into affect and impacts any company with exposure in Europe. In our GDPR use-case we have two sources of input: 1) table(s) with PetaBytes of clickstream data, partitioned by date, each row contains an array of identities (device, user, household, etc...) 2) a stream of GDPR Requests to Access or Delete data based on a unique identity id (UUID) Our first problem is identifying the tables that have identities, but once we have done that we need to scan through each of those tables to find matching rows for a given GDPR Request/identity. We are looking to leverage a bloom filter to answer the question: has the identity in the GDPR Request been seen by table X? If the answer is Yes, then we go the next level down and ask the question at the partition level to determine if we need to scan within a partition. In our GDPR case we have 1) UUIDs 2) multiple identities per row and 3) tight Latency requirements on ingesting data, making it available downstream. Sorting by identity doesn't work with the last 2 and 1 lends itself to BloomFilters. Due to the growing requirements of GDPR across industries this seems less of a narrow use-case? While we could build this outside of Iceberg specific to our GDPR use-case we would be interested to see if this can be generalized within Iceberg making it more generally available. Thanks, Shone On 2019/03/05 18:10:20, Ryan Blue <r...@netflix.com.INVALID> wrote: > +Iceberg Dev List <de...@iceberg.apache.org>, the project has moved to > Apache.> > > Hi Josh,> > > We have considered bloom filters. There are some cases where they could be> > useful, but there are generally better ways to accomplish the same task.> > > I typically recommend sorting on an ID field to take advantage of the lower> > and upper bounds that Iceberg already supports. In addition to the range> > bounds, this also maximizes the likelihood that Parquet dictionaries are> > used for encoding. When dictionaries are available, that's better than a> > row group because it is already generated and stored; plug it can be used> > for filtering without any false-positives. Sorting in Spark also handles> > skew, which is a great bonus.> > > The use case where bloom filters can provide value is when you have a high> > rate of unique values (like a UUID used to identify a record) and cannot> > sort the data because of the volume and when it needs to be available for> > downstream consumption. Iceberg also erodes this use case because you can> > sort the data in the background and atomically swap the unsorted data for> > sorted data. Because you can safely swap in data you've optimized, the data> > is available quickly and only a small portion of the new data takes a long> > time to scan through before it is optimized for reads.> > > I think that the remaining use case for bloom filters is a narrow one. If> > you'd still like to work on it, we can think through what we would need to> > add to the spec. Bloom filters are too large to add to the existing> > metadata structures, like manifests, but we could add an index location for> > each file that stores a bloom filter separately.> > > rb> > > On Sun, Mar 3, 2019 at 9:40 PM Joshua Hollander <jh...@gmail.com> wrote:> > > > Hello, really interesting project. Has any consideration been given to> > > adding bloom filters to the column stats in the manifests?> > >> > > I've developed a custom metastore which stores bloom filters for pruning> > > along side the lower and upper bounds. This allows us to do reasonably> > > fast needle in a haystack searches in our data lake. I know it might be a> > > bit of a unique use case as most folks are looking for aggregates and> > > trends.> > >> > > I realize that it would likely cause a severe bloating of the manifest> > > file considering bloom filter sizes for this kind of data (we use the> > > scalable bloom filter variant in an attempt to mitigate this). Any> > > interest? I'd be interested in possibly contributing to the feature if> > > there was.> > >> > > Thanks,> > > -Josh> > >> > > --> > > You received this message because you are subscribed to the Google Groups> > > "Iceberg Developers" group.> > > To unsubscribe from this group and stop receiving emails from it, send an> > > email to iceberg-devel+unsubscr...@googlegroups.com.> > > To post to this group, send email to iceberg-de...@googlegroups.com.> > > To view this discussion on the web visit> > > https://groups.google.com/d/msgid/iceberg-devel/a57dbe29-2d22-4530-b1c0-af191fe694ca%40googlegroups.com> > > <https://groups.google.com/d/msgid/iceberg-devel/a57dbe29-2d22-4530-b1c0-af191fe694ca%40googlegroups.com?utm_medium=email&utm_source=footer>> > > .> > > For more options, visit https://groups.google.com/d/optout.> > >> > > > -- > > Ryan Blue> > Software Engineer> > Netflix> >