Re: combineGroup get false results

2019-08-22 Thread Fabian Hueske
Hi, If all key fields are primitive types (long) or String, their hash values should be deterministic. There are two things that can go wrong: 1) Records are assigned to the wrong group. 2) The computation of a group is buggy. I'd first check that 1) is correct. Can you replace the sum function

Re: combineGroup get false results

2019-08-22 Thread anissa moussaoui
Hi Fabian, My GroupReduce function sum one column of input rows of each group. My key fields is array of multiple type, in this case is string and long. The result that i'm posting is just represents sampling of output dataset. Thank you in advance ! Anissa Le jeu. 22 août 2019 à 11:24, Fabian

Re: combineGroup get false results

2019-08-22 Thread Fabian Hueske
Hi Anissa, This looks strange. If I understand your code correctly, your GroupReduce function is summing up a field. Looking at the results that you posted, it seems as if there is some data missing (the total sum does not seem to match). For groupReduce it is important that the grouping keys are

Re: combineGroup get false results

2019-08-22 Thread anissa moussaoui
Thanks for your feedback! Sorry, effectively I used reductionGroup, but that gives different results when I change the parallelism to 8 (more than 1) and the true results are with Parallelism 1 and I want to set it to 8. I do not know how do to have the same result by modifying the parallelism us

Re: combineGroup get false results

2019-08-22 Thread Fabian Hueske
Hi Anissa, Are you using combineGroup or reduceGroup? Your question refers to combineGroup, but the code only shows reduceGroup. combineGroup is non-deterministic by design to enable efficient partial results without network and disk IO. reduceGroup is deterministic given a deterministic key extr

combineGroup get false results

2019-08-20 Thread anissa moussaoui
Hi, I used the combineGroup function to reduce groups of a very large dataset. By modifying the parallelism to 1 I have a different results with a parallelism to 8, Knowing that the good results are those obtained with the parallelism with 1. I also used table api to group dataset and select sum