Have you tried something like this?
spark.conf.set("spark.sql.shuffle.partitions", "5" )
On Wed, Apr 3, 2019 at 8:37 PM Arthur Li <[email protected]> wrote:
> Hi Sparkers,
>
> I noticed that in my spark application, the number of tasks in the first
> stage is equal to the number of files read by the application(at least for
> Avro) if the number of cpu cores is less than the number of files. Though
> If cpu cores are more than number of files, it's usually equal to default
> parallelism number. Why is it behave like this? Would this require a lot of
> resource from the driver? Is there any way we can do to decrease the number
> of tasks(partitions) in the first stage without merge files before loading?
>
> Thanks,
> Arthur
>
>
> IMPORTANT NOTICE: This message, including any attachments (hereinafter
> collectively referred to as "Communication"), is intended only for the
> addressee(s)
> named above. This Communication may include information that is
> privileged, confidential and exempt from disclosure under applicable law.
> If the recipient of this Communication is not the intended recipient, or
> the employee or agent responsible for delivering this Communication to the
> intended recipient, you are notified that any dissemination, distribution
> or copying of this Communication is strictly prohibited. If you have
> received this Communication in error, please notify the sender immediately
> by phone or email and permanently delete this Communication from your
> computer without making a copy. Thank you.
--
Thanks,
Jason