Hi,
If all key fields are primitive types (long) or String, their hash values
should be deterministic.
There are two things that can go wrong:
1) Records are assigned to the wrong group.
2) The computation of a group is buggy.
I'd first check that 1) is correct.
Can you replace the sum function
Hi Fabian,
My GroupReduce function sum one column of input rows of each group.
My key fields is array of multiple type, in this case is string and long.
The result that i'm posting is just represents sampling of output dataset.
Thank you in advance !
Anissa
Le jeu. 22 août 2019 à 11:24, Fabian
Hi Anissa,
This looks strange. If I understand your code correctly, your GroupReduce
function is summing up a field.
Looking at the results that you posted, it seems as if there is some data
missing (the total sum does not seem to match).
For groupReduce it is important that the grouping keys are
Thanks for your feedback!
Sorry, effectively I used reductionGroup, but that gives different results
when I change the parallelism to 8 (more than 1) and the true results are
with Parallelism 1 and I want to set it to 8.
I do not know how do to have the same result by modifying the parallelism
us
Hi Anissa,
Are you using combineGroup or reduceGroup?
Your question refers to combineGroup, but the code only shows reduceGroup.
combineGroup is non-deterministic by design to enable efficient partial
results without network and disk IO.
reduceGroup is deterministic given a deterministic key extr
Hi,
I used the combineGroup function to reduce groups of a very large dataset.
By modifying the parallelism to 1 I have a different results with a
parallelism to 8, Knowing that the good results are those obtained with the
parallelism with 1.
I also used table api to group dataset and select sum