Have you tried something like this? spark.conf.set("spark.sql.shuffle.partitions", "5" )
On Wed, Apr 3, 2019 at 8:37 PM Arthur Li <arthur...@flipp.com> wrote: > Hi Sparkers, > > I noticed that in my spark application, the number of tasks in the first > stage is equal to the number of files read by the application(at least for > Avro) if the number of cpu cores is less than the number of files. Though > If cpu cores are more than number of files, it's usually equal to default > parallelism number. Why is it behave like this? Would this require a lot of > resource from the driver? Is there any way we can do to decrease the number > of tasks(partitions) in the first stage without merge files before loading? > > Thanks, > Arthur > > > IMPORTANT NOTICE: This message, including any attachments (hereinafter > collectively referred to as "Communication"), is intended only for the > addressee(s) > named above. This Communication may include information that is > privileged, confidential and exempt from disclosure under applicable law. > If the recipient of this Communication is not the intended recipient, or > the employee or agent responsible for delivering this Communication to the > intended recipient, you are notified that any dissemination, distribution > or copying of this Communication is strictly prohibited. If you have > received this Communication in error, please notify the sender immediately > by phone or email and permanently delete this Communication from your > computer without making a copy. Thank you. -- Thanks, Jason