Hi All, It looks like this proposal has an approval and we can conclude this discussion. Additionally, I agree with Piotr we should really force the proven good reasoning for setting the capacity to avoid confusion, redundancy and other already mentioned things while reading and maintaining the code. Ideally the need of setting the capacity should be either immediately clear (e.g. perf etc) or explained in comments if it is non-trivial. Although, it can easily enter a grey zone, so I would not demand strictly performance measurement proof e.g. if the size is known and it is "per record" code. At the end of the day it is a decision of the code developer and reviewer.
The conclusion is then: Set the initial capacity only if there is a good proven reason to do it. Otherwise do not clutter the code with it. Best, Andrey On Thu, Aug 1, 2019 at 5:10 PM Piotr Nowojski <pi...@ververica.com> wrote: > Hi, > > > - a bit more code, increases maintenance burden. > > I think there is even more to that. It’s almost like a code duplication, > albeit expressed in very different way, with all of the drawbacks of > duplicated code: initial capacity can drift out of sync, causing confusion. > Also it’s not “a bit more code”, it might be non trivial > reasoning/calculation how to set the initial value. Whenever we change > something/refactor the code, "maintenance burden” will mostly come from > that. > > Also I think this just usually falls under a premature optimisation rule. > > Besides: > > > The conclusion is the following at the moment: > > Only set the initial capacity if you have a good idea about the expected > size. > > I would add a clause to set the initial capacity “only for good proven > reasons”. It’s not about whether we can set it, but whether it makes sense > to do so (to avoid the before mentioned "maintenance burden”). > > Piotrek > > > On 1 Aug 2019, at 14:41, Xintong Song <tonysong...@gmail.com> wrote: > > > > +1 on setting initial capacity only when have good expectation on the > > collection size. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Thu, Aug 1, 2019 at 2:32 PM Andrey Zagrebin <and...@ververica.com> > wrote: > > > >> Hi all, > >> > >> As you probably already noticed, Stephan has triggered a discussion > thread > >> about code style guide for Flink [1]. Recently we were discussing > >> internally some smaller concerns and I would like start separate threads > >> for them. > >> > >> This thread is about creating collections always with initial capacity. > As > >> you might have seen, some parts of our code base always initialise > >> collections with some non-default capacity. You can even activate a > check > >> in IntelliJ Idea that can monitor and highlight creation of collection > >> without initial capacity. > >> > >> Pros: > >> - performance gain if there is a good reasoning about initial capacity > >> - the capacity is always deterministic and does not depend on any > changes > >> of its default value in Java > >> - easy to follow: always initialise, has IDE support for detection > >> > >> Cons (for initialising w/o good reasoning): > >> - We are trying to outsmart JVM. When there is no good reasoning about > >> initial capacity, we can rely on JVM default value. > >> - It is even confusing e.g. for hash maps as the real size depends on > the > >> load factor. > >> - It would only add minor performance gain. > >> - a bit more code, increases maintenance burden. > >> > >> The conclusion is the following at the moment: > >> Only set the initial capacity if you have a good idea about the expected > >> size. > >> > >> Please, feel free to share you thoughts. > >> > >> Best, > >> Andrey > >> > >> [1] > >> > >> > http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3ced91df4b-7cab-4547-a430-85bc710fd...@apache.org%3E > >> > >