What is this dataset? text file or parquet file?

There is an issue with serialization in Spark SQL, which will make it
very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will
be fixed very soon.

Davies

On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy
<guillaume.c....@gmail.com> wrote:
> Hi Sean:
>
> Thanks for your feedback. Scala is much faster. The count is performed in ~1
> minutes (vs 17min). I would expect scala to be 2-5X faster but this gap
> seems to be more than that. Is that also your conclusion?
>
> Thanks.
>
>
> Best,
>
> Guillaume Guy
>  +1 919 - 972 - 8750
>
> On Fri, Feb 27, 2015 at 9:12 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> That's very slow, and there are a lot of possible explanations. The
>> first one that comes to mind is: I assume your YARN and HDFS are on
>> the same machines, but are you running executors on all HDFS nodes
>> when you run this? if not, a lot of these reads could be remote.
>>
>> You have 6 executor slots, but your data exists in 96 blocks on HDFS.
>> You could read with up to 96-way parallelism. You say you're CPU-bound
>> though, but normally I'd wonder if this was simply a case of
>> under-using parallelism.
>>
>> I also wonder if the bottleneck is something to do with pyspark in
>> this case; might be good to just try it in the spark-shell to check.
>>
>> On Fri, Feb 27, 2015 at 2:00 PM, Guillaume Guy
>> <guillaume.c....@gmail.com> wrote:
>> > Dear Spark users:
>> >
>> > I want to see if anyone has an idea of the performance for a small
>> > cluster.
>> >
>> > Reading from HDFS, what should be the performance of  a count()
>> > operation on
>> > an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage,
>> > all 6
>> > are at 100%.
>> >
>> > Details:
>> >
>> > master yarn-client
>> > num-executors 3
>> > executor-cores 2
>> > driver-memory 5g
>> > executor-memory 2g
>> > Distribution: Cloudera
>> >
>> > I also attached the screenshot.
>> >
>> > Right now, I'm at 17 minutes which seems quite slow. Any idea how a
>> > decent
>> > performance with similar configuration?
>> >
>> > If it's way off, I would appreciate any pointers as to ways to improve
>> > performance.
>> >
>> > Thanks.
>> >
>> > Best,
>> >
>> > Guillaume
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to