Hi Ryan & Josh,

Just joined the email list. I was pointed to the previous discussion on Bloom 
filters that recently came up and am belatedly responding.

I believe we have a use-case that warrants the use of Bloom Filters, but would 
like to hear your(and others) feedback. The use-case is General Data Protection 
Regulation (GDPR) that recently went into affect and impacts any company with 
exposure in Europe.
In our GDPR use-case we have two sources of input:

1) table(s) with PetaBytes of clickstream data, partitioned by date, each row 
contains an array of identities (device, user, household, etc...)
2) a stream of GDPR Requests to Access or Delete data based on a unique 
identity id (UUID)

Our first problem is identifying the tables that have identities, but once we 
have done that we need to scan through each of those tables to find matching 
rows for a given GDPR Request/identity.

We are looking to leverage a bloom filter to answer the question: has the 
identity in the GDPR Request been seen by table X? If the answer is Yes, then 
we go the next level down and ask the question at the partition level to 
determine if we need to scan within a partition.

In our GDPR case we have 1) UUIDs 2) multiple identities per row and 3) tight 
Latency requirements on ingesting data, making it available downstream.  
Sorting by identity doesn't work with the last 2 and 1 lends itself to 
BloomFilters.   Due to the growing requirements of GDPR across industries this 
seems less of a narrow use-case?

While we could build this outside of Iceberg specific to our GDPR use-case we 
would be interested to see if this can be generalized within Iceberg making it 
more generally available.

Thanks,
Shone



On 2019/03/05 18:10:20, Ryan Blue <r...@netflix.com.INVALID> wrote:
> +Iceberg Dev List <de...@iceberg.apache.org>, the project has moved to 
> Apache.>
>
> Hi Josh,>
>
> We have considered bloom filters. There are some cases where they could be>
> useful, but there are generally better ways to accomplish the same task.>
>
> I typically recommend sorting on an ID field to take advantage of the lower>
> and upper bounds that Iceberg already supports. In addition to the range>
> bounds, this also maximizes the likelihood that Parquet dictionaries are>
> used for encoding. When dictionaries are available, that's better than a>
> row group because it is already generated and stored; plug it can be used>
> for filtering without any false-positives. Sorting in Spark also handles>
> skew, which is a great bonus.>
>
> The use case where bloom filters can provide value is when you have a high>
> rate of unique values (like a UUID used to identify a record) and cannot>
> sort the data because of the volume and when it needs to be available for>
> downstream consumption. Iceberg also erodes this use case because you can>
> sort the data in the background and atomically swap the unsorted data for>
> sorted data. Because you can safely swap in data you've optimized, the data>
> is available quickly and only a small portion of the new data takes a long>
> time to scan through before it is optimized for reads.>
>
> I think that the remaining use case for bloom filters is a narrow one. If>
> you'd still like to work on it, we can think through what we would need to>
> add to the spec. Bloom filters are too large to add to the existing>
> metadata structures, like manifests, but we could add an index location for>
> each file that stores a bloom filter separately.>
>
> rb>
>
> On Sun, Mar 3, 2019 at 9:40 PM Joshua Hollander <jh...@gmail.com> wrote:>
>
> > Hello, really interesting project.  Has any consideration been given to>
> > adding bloom filters to the column stats in the manifests?>
> >>
> > I've developed a custom metastore which stores bloom filters for pruning>
> > along side the lower and upper bounds.  This allows us to do reasonably>
> > fast needle in a haystack searches in our data lake.  I know it might be a>
> > bit of a unique use case as most folks are looking for aggregates and>
> > trends.>
> >>
> > I realize that it would likely cause a severe bloating of the manifest>
> > file considering bloom filter sizes for this kind of data (we use the>
> > scalable bloom filter variant in an attempt to mitigate this).  Any>
> > interest?  I'd be interested in possibly contributing to the feature if>
> > there was.>
> >>
> > Thanks,>
> > -Josh>
> >>
> > -->
> > You received this message because you are subscribed to the Google Groups>
> > "Iceberg Developers" group.>
> > To unsubscribe from this group and stop receiving emails from it, send an>
> > email to iceberg-devel+unsubscr...@googlegroups.com.>
> > To post to this group, send email to iceberg-de...@googlegroups.com.>
> > To view this discussion on the web visit>
> > https://groups.google.com/d/msgid/iceberg-devel/a57dbe29-2d22-4530-b1c0-af191fe694ca%40googlegroups.com>
> > <https://groups.google.com/d/msgid/iceberg-devel/a57dbe29-2d22-4530-b1c0-af191fe694ca%40googlegroups.com?utm_medium=email&utm_source=footer>>
> > .>
> > For more options, visit https://groups.google.com/d/optout.>
> >>
>
>
> -- >
> Ryan Blue>
> Software Engineer>
> Netflix>
>

Reply via email to