Hello, I have a very big dataset A to left join with a dataset B that is half its size. That is to say, half of A records will be matched with one record of B, and the other half with null values.
I used a CoGroup for that, but my batch fails because yarn kills the container due to memory problems. I guess that’s because one worker will get half of A dataset (the unmatched ones), and that’s too much for a single JVM Am I right in my diagnostic ? Is there a better way to left join unbalanced datasets ? Best regards, Arnaud ________________________________ L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur. The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.