[ https://issues.apache.org/jira/browse/HUDI-3091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sivabalan narayanan closed HUDI-3091. ------------------------------------- Resolution: Fixed > Make simple index as the default hoodie.index.type > -------------------------------------------------- > > Key: HUDI-3091 > URL: https://issues.apache.org/jira/browse/HUDI-3091 > Project: Apache Hudi > Issue Type: New Feature > Components: index > Reporter: Vinoth Govindarajan > Assignee: sivabalan narayanan > Priority: Critical > Labels: pull-request-available > Fix For: 0.11.0 > > Original Estimate: 1h > Time Spent: 2h > Remaining Estimate: 0h > > When performing upserts with derived datasets, we often run into an OOM issue > with the bloom filter, hence we changed all the dataset index types to simple > to resolve the issue. > > Some of the tables were non-partitioned tables for which bloom index is not > the right choice. > I'm proposing to make a simple index as the default value and on case-by-case > basics, folks can choose the bloom filter for additional performance gains > offered by bloom filters. > > I agree that the performance will not be optimal but for regular use cases > simple index would not break and give them sub-optimal read/write performance > but it won't break any ingestion/derived jobs. > > > Tests to validate the flip: > Trigger some ingestions (either spark datasource or deltastreamer) with > record keys having some timestamp characteristics. > Updates 5 to 10%. > Dataset size: 100GB. > measure index look up time across bloom index and simple index. > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)