In addition to what Eno already mentioned here's some quick feedback: - Only for reference, I'd add that 20GB of state is not necessarily "massive" in absolute terms. I have talked to users whose apps manage much more state than that (1-2 orders of magnitude more). Whether or not 20 GB is massive for your use case is a different question of course, particularly if you expected much less than 20GB. ;-)
- One further option would be to use probabilistic data structures instead of the default RocksDB or in-memory key-value stores. I put up a POC that demonstrates how to peform probabilistic counting with a Count-Min Sketch backed state store [1]. I think a similar approach would also work for e.g. Bloomfilters, which might be one potential solution to downsize your de-duplication problem. Best, Michael [1] https://github.com/confluentinc/examples/pull/100 On Fri, Mar 10, 2017 at 12:22 PM, Eno Thereska <eno.there...@gmail.com> wrote: > It’s not necessarily the wrong tool since deduplication is a standard > scenario, but just setting expectations. If you have enough memory I wonder > if it would make sense to do it all in-memory with an in-memory store. > Depends on whether disk or memory space is at a premium. > > Thanks > Eno > > > On Mar 10, 2017, at 11:05 AM, Ian Duffy <i...@ianduffy.ie> wrote: > > > > Hi Eno, > > > > Thanks for the fast response. > > > > We are doing a deduplication process here, so yes you are correct the > keys > > are normally unique. Sounds like a wrong tool for the job issue on my > end. > > > > Thanks for your input here. > > > > > > > > On 10 March 2017 at 10:59, Eno Thereska <eno.there...@gmail.com> wrote: > > > >> Hi Ian, > >> > >> Sounds like you have a total topic size of ~20GB (96 partitions x > 200mb). > >> If most keys are unique then group and reduce might not be as effective > in > >> grouping/reducing. Can you comment on the key distribution? Are most > keys > >> unique? Or do you expect lots of keys to be the same in the topic? > >> > >> Thanks > >> Eno > >> > >> > >>> On Mar 10, 2017, at 9:05 AM, Ian Duffy <i...@ianduffy.ie> wrote: > >>> > >>> Hi All, > >>> > >>> I'm doing a groupBy and reduce on a kstream which results in a state > >> store > >>> being created. > >>> > >>> This state store is growing to be massive, its filled up a 20gb drive. > >> This > >>> feels very unexpected. Is there some cleanup or flushing process for > the > >>> state stores that I'm missing or is such a large size expected? > >>> > >>> The topic in question has 96 partitions and the state is about ~200mb > >>> average for each one. > >>> > >>> 175M 1_0 > >>> 266M 1_1 > >>> 164M 1_10 > >>> 177M 1_11 > >>> 142M 1_12 > >>> 271M 1_13 > >>> 158M 1_14 > >>> 280M 1_15 > >>> 286M 1_16 > >>> 181M 1_17 > >>> 185M 1_18 > >>> 187M 1_19 > >>> 281M 1_2 > >>> 278M 1_20 > >>> 188M 1_21 > >>> 262M 1_22 > >>> 166M 1_23 > >>> 177M 1_24 > >>> 268M 1_25 > >>> 264M 1_26 > >>> 147M 1_27 > >>> 179M 1_28 > >>> 276M 1_29 > >>> 177M 1_3 > >>> 157M 1_30 > >>> 137M 1_31 > >>> 247M 1_32 > >>> 275M 1_33 > >>> 169M 1_34 > >>> 267M 1_35 > >>> 283M 1_36 > >>> 171M 1_37 > >>> 166M 1_38 > >>> 277M 1_39 > >>> 160M 1_4 > >>> 273M 1_40 > >>> 278M 1_41 > >>> 279M 1_42 > >>> 170M 1_43 > >>> 139M 1_44 > >>> 272M 1_45 > >>> 179M 1_46 > >>> 283M 1_47 > >>> 263M 1_48 > >>> 267M 1_49 > >>> 181M 1_5 > >>> 282M 1_50 > >>> 166M 1_51 > >>> 161M 1_52 > >>> 176M 1_53 > >>> 152M 1_54 > >>> 172M 1_55 > >>> 148M 1_56 > >>> 268M 1_57 > >>> 144M 1_58 > >>> 177M 1_59 > >>> 271M 1_6 > >>> 279M 1_60 > >>> 266M 1_61 > >>> 194M 1_62 > >>> 177M 1_63 > >>> 267M 1_64 > >>> 177M 1_65 > >>> 271M 1_66 > >>> 175M 1_67 > >>> 168M 1_68 > >>> 140M 1_69 > >>> 175M 1_7 > >>> 173M 1_70 > >>> 179M 1_71 > >>> 178M 1_72 > >>> 166M 1_73 > >>> 180M 1_74 > >>> 177M 1_75 > >>> 276M 1_76 > >>> 177M 1_77 > >>> 162M 1_78 > >>> 266M 1_79 > >>> 194M 1_8 > >>> 158M 1_80 > >>> 187M 1_81 > >>> 162M 1_82 > >>> 163M 1_83 > >>> 177M 1_84 > >>> 286M 1_85 > >>> 165M 1_86 > >>> 171M 1_87 > >>> 162M 1_88 > >>> 179M 1_89 > >>> 145M 1_9 > >>> 166M 1_90 > >>> 190M 1_91 > >>> 159M 1_92 > >>> 284M 1_93 > >>> 172M 1_94 > >>> 149M 1_95 > >> > >> > >