Re: Coalesce behaviour

2018-10-14 Thread Jörn Franke
This is not fully correct. If you have less files then you need to move some data to some other nodes, because not all the data is there for writing (even the case for the same node, but then it is easier from a network perspective). Hence a shuffling is needed. > Am 15.10.2018 um 05:04 schrie

Re: Timestamp Difference/operations

2018-10-14 Thread Paras Agarwal
Thanks John, Actually need full date and time difference not just date difference, which I guess not supported. Let me know if its possible, or any UDF available for the same. Thanks And Regards, Paras From: John Zhuge Sent: Friday, October 12, 2018 9:48:

Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Takeshi Yamamuro
Hi, ishizaki-san, Cool activity, I left some comments on the doc. best, takeshi On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki wrote: > Hello community, > > I am writing this e-mail in order to start a discussion about adding > structure intermediate representation for generating Java code

Re: Coalesce behaviour

2018-10-14 Thread Koert Kuipers
sure, i understand currently the workaround is to add a shuffle. but that's just a workaround, not a satisfactory solution: we shouldn't have to introduce another shuffle (an expensive operation) just to reduce the number of files. logically all you need is a map-phase with less tasks after the re

Re: Coalesce behaviour

2018-10-14 Thread Wenchen Fan
You have a heavy workload, you want to run it with many tasks for better performance and stability(no OMM), but you also want to run it with few tasks to avoid too many small files. The reality is, mostly you can't reach these 2 goals together, they conflict with each other. The solution I can thin

SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki
Hello community, I am writing this e-mail in order to start a discussion about adding structure intermediate representation for generating Java code from a program using DataFrame or Dataset API, in addition to the current String-based representation. This addition is based on the discussions i