Hi bitfox,
you need pip install sparkmeasure firstly. then can lanch in pysaprk.
| >>> from sparkmeasure import StageMetrics
>>> stagemetrics = StageMetrics(spark)
>>> stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from
>>> range(1000) cross join range(1000) cross join range(100)").show()')
+---------+
| count(1)|
+---------+
|100000000|
+---------+
|
Regards,
Hollis
At 2021-12-24 09:18:19, [email protected] wrote:
>Hello list,
>
>I run with Spark 3.2.0
>
>After I started pyspark with:
>$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
>
>I can't load from the module sparkmeasure:
>
>>>> from sparkmeasure import StageMetrics
>Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
>ModuleNotFoundError: No module named 'sparkmeasure'
>
>Do you know why? @Luca thanks.
>
>
>On 2021-12-24 04:20, [email protected] wrote:
>> Thanks Gourav and Luca. I will try with the tools you provide in the
>> Github.
>>
>> On 2021-12-23 23:40, Luca Canali wrote:
>>> Hi,
>>>
>>> I agree with Gourav that just measuring execution time is a simplistic
>>> approach that may lead you to miss important details, in particular
>>> when running distributed computations.
>>>
>>> WebUI, REST API, and metrics instrumentation in Spark can be quite
>>> useful for further drill down. See
>>> https://spark.apache.org/docs/latest/monitoring.html
>>>
>>> You can also have a look at this tool that takes care of automating
>>> collecting and aggregating some executor task metrics:
>>> https://github.com/LucaCanali/sparkMeasure
>>>
>>> Best,
>>>
>>> Luca
>>>
>>> From: Gourav Sengupta <[email protected]>
>>> Sent: Thursday, December 23, 2021 14:23
>>> To: [email protected]
>>> Cc: user <[email protected]>
>>> Subject: Re: measure running time
>>>
>>> Hi,
>>>
>>> I do not think that such time comparisons make any sense at all in
>>> distributed computation. Just saying that an operation in RDD and
>>> Dataframe can be compared based on their start and stop time may not
>>> provide any valid information.
>>>
>>> You will have to look into the details of timing and the steps. For
>>> example, please look at the SPARK UI to see how timings are calculated
>>> in distributed computing mode, there are several well written papers
>>> on this.
>>>
>>> Thanks and Regards,
>>>
>>> Gourav Sengupta
>>>
>>> On Thu, Dec 23, 2021 at 10:57 AM <[email protected]> wrote:
>>>
>>>> hello community,
>>>>
>>>> In pyspark how can I measure the running time to the command?
>>>> I just want to compare the running time of the RDD API and dataframe
>>>>
>>>> API, in my this blog:
>>>>
>>> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
>>>>
>>>> I tried spark.time() it doesn't work.
>>>> Thank you.
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [email protected]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>
>---------------------------------------------------------------------
>To unsubscribe e-mail: [email protected]