Re: slow SQL query with cached dataset

Mich Talebzadeh Thu, 28 Apr 2016 09:02:40 -0700

Hi Imran,

" How do I tell if it's spilling to disk?"


Well that is a very valid question. I do not have a quantitative matrix to
use it to state that out of X GB of data in Spark, Y GB has been spilled to
disk because of the volume of data.

Unlike an RDBMS Spark uses memory ass opposed to shared memory. When RDBMS
hits the memory limit it starts swapping that one can see with swap -l

The only way I believe one can measure it by looking at parameters passed
to spark submit

${SPARK_HOME}/bin/spark-submit \
                --packages com.databricks:spark-csv_2.11:1.3.0 \
                --jars
/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
                --class "${FILE_NAME}" \
                --master spark://50.140.197.217:7077 \
                --executor-memory=12G \
                --executor-cores=12 \
                --num-executors=2 \
                ${JAR_FILE}

So I have not seen a tool that shows the spillage of data quantitatively.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 28 April 2016 at 16:36, Imran Akbar <skunkw...@gmail.com> wrote:

> Thanks Dr. Mich, Jorn,
>
> It's about 150 million rows in the cached dataset.  How do I tell if it's
> spilling to disk?  I didn't really see any logs to that affect.
> How do I determine the optimal number of partitions for a given input
> dataset?  What's too much?
>
> regards,
> imran
>
> On Mon, Apr 25, 2016 at 3:55 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Are you sure it is not spilling to disk?
>>
>> How many rows are cached in your result set -> sqlContext.sql("SELECT *
>> FROM raw WHERE (dt_year=2015 OR dt_year=2016)")
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 25 April 2016 at 23:47, Imran Akbar <skunkw...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm running a simple query like this through Spark SQL:
>>>
>>> sqlContext.sql("SELECT MIN(age) FROM data WHERE country = 'GBR' AND
>>> dt_year=2015 AND dt_month BETWEEN 1 AND 11 AND product IN
>>> ('cereal')").show()
>>>
>>> which takes 3 minutes to run against an in-memory cache of 9 GB of data.
>>>
>>> The data was 100% cached in memory before I ran the query (see
>>> screenshot 1).
>>> The data was cached like this:
>>> data = sqlContext.sql("SELECT * FROM raw WHERE (dt_year=2015 OR
>>> dt_year=2016)")
>>> data.cache()
>>> data.registerTempTable("data")
>>> and then I ran an action query to load the data into the cache.
>>>
>>> I see lots of rows of logs like this:
>>> 16/04/25 22:39:11 INFO MemoryStore: Block rdd_13136_2856 stored as
>>> values in memory (estimated size 2.5 MB, free 9.7 GB)
>>> 16/04/25 22:39:11 INFO BlockManager: Found block rdd_13136_2856 locally
>>> 16/04/25 22:39:11 INFO MemoryStore: 6 blocks selected for dropping
>>> 16/04/25 22:39:11 INFO BlockManager: Dropping block rdd_13136_3866 from
>>> memory
>>>
>>> Screenshot 2 shows the job page of the longest job.
>>>
>>> The data was partitioned in Parquet by month, country, and product
>>> before I cached it.
>>>
>>> Any ideas what the issue could be?  This is running on localhost.
>>>
>>> regards,
>>> imran
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>

Re: slow SQL query with cached dataset

Reply via email to