In addition to what Eno already mentioned here's some quick feedback:

- Only for reference, I'd add that 20GB of state is not necessarily
"massive" in absolute terms.  I have talked to users whose apps manage much
more state than that (1-2 orders of magnitude more).  Whether or not 20 GB
is massive for your use case is a different question of course,
particularly if you expected much less than 20GB. ;-)

- One further option would be to use probabilistic data structures instead
of the default RocksDB or in-memory key-value stores.  I put up a POC that
demonstrates how to peform probabilistic counting with a Count-Min Sketch
backed state store [1].  I think a similar approach would also work for
e.g. Bloomfilters, which might be one potential solution to downsize your
de-duplication problem.

Best,
Michael



[1] https://github.com/confluentinc/examples/pull/100



On Fri, Mar 10, 2017 at 12:22 PM, Eno Thereska <eno.there...@gmail.com>
wrote:

> It’s not necessarily the wrong tool since deduplication is a standard
> scenario, but just setting expectations. If you have enough memory I wonder
> if it would make sense to do it all in-memory with an in-memory store.
> Depends on whether disk or memory space is at a premium.
>
> Thanks
> Eno
>
> > On Mar 10, 2017, at 11:05 AM, Ian Duffy <i...@ianduffy.ie> wrote:
> >
> > Hi Eno,
> >
> > Thanks for the fast response.
> >
> > We are doing a deduplication process here, so yes you are correct the
> keys
> > are normally unique. Sounds like a wrong tool for the job issue on my
> end.
> >
> > Thanks for your input here.
> >
> >
> >
> > On 10 March 2017 at 10:59, Eno Thereska <eno.there...@gmail.com> wrote:
> >
> >> Hi Ian,
> >>
> >> Sounds like you have a total topic size of ~20GB (96 partitions x
> 200mb).
> >> If most keys are unique then group and reduce might not be as effective
> in
> >> grouping/reducing. Can you comment on the key distribution? Are most
> keys
> >> unique? Or do you expect lots of keys to be the same in the topic?
> >>
> >> Thanks
> >> Eno
> >>
> >>
> >>> On Mar 10, 2017, at 9:05 AM, Ian Duffy <i...@ianduffy.ie> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I'm doing a groupBy and reduce on a kstream which results in a state
> >> store
> >>> being created.
> >>>
> >>> This state store is growing to be massive, its filled up a 20gb drive.
> >> This
> >>> feels very unexpected. Is there some cleanup or flushing process for
> the
> >>> state stores that I'm missing or is such a large size expected?
> >>>
> >>> The topic in question has 96 partitions and the state is about ~200mb
> >>> average for each one.
> >>>
> >>> 175M 1_0
> >>> 266M 1_1
> >>> 164M 1_10
> >>> 177M 1_11
> >>> 142M 1_12
> >>> 271M 1_13
> >>> 158M 1_14
> >>> 280M 1_15
> >>> 286M 1_16
> >>> 181M 1_17
> >>> 185M 1_18
> >>> 187M 1_19
> >>> 281M 1_2
> >>> 278M 1_20
> >>> 188M 1_21
> >>> 262M 1_22
> >>> 166M 1_23
> >>> 177M 1_24
> >>> 268M 1_25
> >>> 264M 1_26
> >>> 147M 1_27
> >>> 179M 1_28
> >>> 276M 1_29
> >>> 177M 1_3
> >>> 157M 1_30
> >>> 137M 1_31
> >>> 247M 1_32
> >>> 275M 1_33
> >>> 169M 1_34
> >>> 267M 1_35
> >>> 283M 1_36
> >>> 171M 1_37
> >>> 166M 1_38
> >>> 277M 1_39
> >>> 160M 1_4
> >>> 273M 1_40
> >>> 278M 1_41
> >>> 279M 1_42
> >>> 170M 1_43
> >>> 139M 1_44
> >>> 272M 1_45
> >>> 179M 1_46
> >>> 283M 1_47
> >>> 263M 1_48
> >>> 267M 1_49
> >>> 181M 1_5
> >>> 282M 1_50
> >>> 166M 1_51
> >>> 161M 1_52
> >>> 176M 1_53
> >>> 152M 1_54
> >>> 172M 1_55
> >>> 148M 1_56
> >>> 268M 1_57
> >>> 144M 1_58
> >>> 177M 1_59
> >>> 271M 1_6
> >>> 279M 1_60
> >>> 266M 1_61
> >>> 194M 1_62
> >>> 177M 1_63
> >>> 267M 1_64
> >>> 177M 1_65
> >>> 271M 1_66
> >>> 175M 1_67
> >>> 168M 1_68
> >>> 140M 1_69
> >>> 175M 1_7
> >>> 173M 1_70
> >>> 179M 1_71
> >>> 178M 1_72
> >>> 166M 1_73
> >>> 180M 1_74
> >>> 177M 1_75
> >>> 276M 1_76
> >>> 177M 1_77
> >>> 162M 1_78
> >>> 266M 1_79
> >>> 194M 1_8
> >>> 158M 1_80
> >>> 187M 1_81
> >>> 162M 1_82
> >>> 163M 1_83
> >>> 177M 1_84
> >>> 286M 1_85
> >>> 165M 1_86
> >>> 171M 1_87
> >>> 162M 1_88
> >>> 179M 1_89
> >>> 145M 1_9
> >>> 166M 1_90
> >>> 190M 1_91
> >>> 159M 1_92
> >>> 284M 1_93
> >>> 172M 1_94
> >>> 149M 1_95
> >>
> >>
>
>

Reply via email to