This is not fully correct. If you have less files then you need to move some
data to some other nodes, because not all the data is there for writing (even
the case for the same node, but then it is easier from a network perspective).
Hence a shuffling is needed.
> Am 15.10.2018 um 05:04 schrie
Thanks John,
Actually need full date and time difference not just date difference,
which I guess not supported.
Let me know if its possible, or any UDF available for the same.
Thanks And Regards,
Paras
From: John Zhuge
Sent: Friday, October 12, 2018 9:48:
Hi, ishizaki-san,
Cool activity, I left some comments on the doc.
best,
takeshi
On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki
wrote:
> Hello community,
>
> I am writing this e-mail in order to start a discussion about adding
> structure intermediate representation for generating Java code
sure, i understand currently the workaround is to add a shuffle. but that's
just a workaround, not a satisfactory solution: we shouldn't have to
introduce another shuffle (an expensive operation) just to reduce the
number of files.
logically all you need is a map-phase with less tasks after the re
You have a heavy workload, you want to run it with many tasks for better
performance and stability(no OMM), but you also want to run it with few
tasks to avoid too many small files. The reality is, mostly you can't reach
these 2 goals together, they conflict with each other. The solution I can
thin
Hello community,
I am writing this e-mail in order to start a discussion about adding
structure intermediate representation for generating Java code from a
program using DataFrame or Dataset API, in addition to the current
String-based representation.
This addition is based on the discussions i