Thinking about it I came up that adding a map function after the read is probably more general. Is there any "significant" difference in terms of performance in using such dedicated map function (that just reads a row, increment an accumulator and returns immediately) vs adding this accumulator directly in the input formats?
On Mon, Feb 4, 2019 at 10:18 AM Flavio Pompermaier <pomperma...@okkam.it> wrote: > Hi to all, > we often need to track the number of rows of a dataset. > In order to burden on the job complexitye we use accumulators to track > this information. > The problem is that we have to extends all InputFormats that we use in > order to properly handle such row-count accumulator...my question is: what > about introducing it as a first class citizen (forcing all input format to > handle a rowCount accumulator when required)? > > What do you think? Will it be useful in general? > > Best, > Flavio >