Thanks Viktor, this was what I was looking for. Ok, so due to existing difficulties in translating push to pull, it makes sense why this wouldn't work.
I really look forward to upgrading my codebase soon. Gatherers fixed window completely negates this problem. At least, it appears to. On Mon, Oct 21, 2024, 6:54 AM Viktor Klang <viktor.kl...@oracle.com> wrote: > Hi David, > > Stream::spliterator() and Stream::iterator() suffer from inherent > limitations (see https://bugs.openjdk.org/browse/JDK-8268483 ) because > they attempt to convert push-style streams into pull-style constructs > (Iterator, Spliterator). Since the only way to know if there's something to > pull is for something to get pushed, but who's doing to pushing (answer: > the same one who tries to do the pulling)? > > For sequential Streams this is often as problematic, but for parallel > streams it's not the caller which evaluates the stream but rather a > task-tree submitted to a ForkJoinPool. You can see the implementation here: > https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/stream/StreamSpliterators.java#L272 > > As an example, Stream::flatMap(…) had corner cases where its use of > Stream::spliterator() would end up not terminating, which was fixed in Java > 23: https://github.com/openjdk/jdk/pull/18625 > <https://github.com/openjdk/jdk/pull/18625> > 8196106: Support nested infinite or recursive flat mapped streams by > viktorklang-ora · Pull Request #18625 · openjdk/jdk > <https://github.com/openjdk/jdk/pull/18625> > This PR implements Gatherer-inspired encoding of flatMap that shows that > it is both competitive performance-wise as well as improve correctness. > Below is the performance of Stream::flatMap (for ref... > github.com > > > > > Cheers, > √ > > > *Viktor Klang* > Software Architect, Java Platform Group > Oracle > ------------------------------ > *From:* core-libs-dev <core-libs-dev-r...@openjdk.org> on behalf of David > Alayachew <davidalayac...@gmail.com> > *Sent:* Saturday, 19 October 2024 07:54 > *To:* core-libs-dev <core-libs-dev@openjdk.org> > *Subject:* Streams, parallelization, and OOME. > > Hello Core Libs Dev Team, > > I have a file that I am streaming from a service, and I am trying to split > into multiple parts based on a certain attribute found on each line. I am > sending each part up to a different service. > > I am using BufferedReader.lines(). However, I cannot read the whole file > into memory because it is larger than the amount of RAM that I have on the > machine. So, since I don't have access to Java 22's Preview Gatherers Fixed > Window, I used the iterator() method on my stream, wrapped that in another > iterator that can grab my batch size worth of data, then built a > spliterator from that that I then used to create a new stream. In short, > this wrapper iterator isn't Iterator<T>, it's Iterator<List<T>>. > > When I ran this sequentially, everything worked well. However, my CPU was > low and we definitely have a performance problem -- our team needs this > number as fast as we can get. Plus, we had plenty of network bandwidth to > spare, so I had (imo) good reason to go use parallelism. > > As soon as I turned on parallelism, the stream's behaviour changed > completely. Instead of fetching the batch and processing, it started > grabbing SEVERAL BATCHES and processing NONE OF THEM. Or at the very least, > it grabbed so many batches that it ran out of memory before it could get to > processing them. > > To give some numbers, this is a 4 core machine. And we can safely hold > about 30-40 batches worth of data in memory before crashing. But again, > when running sequentially, this thing only grabs 1 batch, processes that > one batch, sends out the results, and then start the next one, all as > expected. I thought that adding parallelism would simply make it so that we > have this happening 4 or 8 times at once. > > After a very long period of digging, I managed to find this link. > > > https://stackoverflow.com/questions/30825708/java-8-using-parallel-in-a-stream-causes-oom-error > > Tagir Valeev gives an answer which doesn't go very deep into the "why" at > all. And the answer is more directed to the user's specific question as > opposed to solving this particular problem. > > After digging through a bunch of other solutions (plus my own testing), it > seems that the answer is that the engine that does parallelization for > Streams tries to grab a large enough "buffer" before doing any parallel > processing. I could be wrong, and how large that buffer is? I have no idea. > > Regardless, that's about where I gave up and went sequential, since the > clock was ticking. > > But I still have a performance problem. How would one suggest going about > this in Java 8? > > Thank you for your time and help. > David Alayachew > >