f you use a huge
> amount of data then you will see more tasks - that means it has some kind
> of lower bound on num-tasks.. It may require some digging. other formats
> did not seem to have this issue.
>
> On Sun, May 8, 2016 at 12:10 AM, Johnny W. wrote:
>
>> The file s
ish Dubey wrote:
> How big is your file and can you also share the code snippet
>
>
> On Saturday, May 7, 2016, Johnny W. wrote:
>
>> hi spark-user,
>>
>> I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
>> dataframe from a parquet data so
hi spark-user,
I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
dataframe from a parquet data source with a single parquet file, it yields
a stage with lots of small tasks. It seems the number of tasks depends on
how many executors I have instead of how many parquet files/partit
Hi spark-user,
I am using spark 1.6 to build reverse index for one month of twitter data
(~50GB). The split size of HDFS is 1GB, thus by default sc.textFile creates
50 partitions. I'd like to increase the parallelism by increase the number
of input partitions. Thus, I use textFile(..., 200) to yie