Well, that's not strictly correct. I had two different memory leaks on the
driver side because of caching. Both of them in stream processes; one of
them in Scala (I forgot to unpersist the cached dataframe) and the other
one in PySpark (unpersisting cached dataframes wasn't enough because of
Python bindings).

I'm planning to write articles for both cases.

El lun, 17 feb 2025, 21:25, Subhasis Mukherjee <subhtec...@gmail.com>
escribió:

> > I understood that caching a table pegged the RDD partitions into the
> memory of the executors holding the partition.
>
> Your understanding is correct. Nothing to worry on the driver side while
> creating the temp view.
>
> On Sun, Feb 16, 2025, 10:47 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Ok let us look at this
>>
>>    -
>>
>>    Temporary Views,  Metadata is stored on the driver; *data remains
>>    distributed across executors.*
>>    -
>>
>>    Caching/Persisting, *Data is stored in the executors' memory or
>>    disk. *
>>    -
>>
>>    The statement *"created on driver memory"* refers to the metadata of
>>    temporary views, not the actual data. The data itself is not loaded into
>>    the driver unless explicitly collected.
>>
>>
>> In summary:
>>
>>    - Data is stored in the executors' memory or disk during normal
>>    operations.
>>    - The driver only holds metadata unless you explicitly collect data
>>    to it.
>>    - Temporary views and caching/persisting are different mechanisms
>>    with different memory implications.
>>
>> HTH
>>
>> Dr Mich Talebzadeh,
>>
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Sun, 16 Feb 2025 at 13:29, Tim Robertson <timrobertson...@gmail.com>
>> wrote:
>>
>>> Thanks Mich
>>>
>>> > created on driver memory
>>>
>>> That I hadn't anticipated. Are you sure?
>>> I understood that caching a table pegged the RDD partitions into the
>>> memory of the executors holding the partition.
>>>
>>>
>>>
>>>
>>> On Sun, Feb 16, 2025 at 11:17 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> yep. created on driver memory. watch for OOM if the size becomes too
>>>> large
>>>>
>>>> spark-submit --driver-memory 8G ...
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh,
>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, 16 Feb 2025 at 09:16, Tim Robertson <timrobertson...@gmail.com>
>>>> wrote:
>>>>
>>>>> Answering my own question. Global temp views get created in the
>>>>> global_temp database, so can be accessed thusly.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Dataset<Row> s = spark.read().parquet("/tmp/svampeatlas/*");
>>>>> s.createOrReplaceGlobalTempView("occurrence_svampe");
>>>>> spark.catalog().cacheTable("global_temp.occurrence_svampe");
>>>>>
>>>>>
>>>>> On Sun, Feb 16, 2025 at 10:05 AM Tim Robertson <
>>>>> timrobertson...@gmail.com> wrote:
>>>>>
>>>>>> Hi folks
>>>>>>
>>>>>> Is it possible to cache a table for shared use across sessions with
>>>>>> spark connect?
>>>>>> I'd like to load a read only table once that many sessions will then
>>>>>> query to improve performance.
>>>>>>
>>>>>> This is an example of the kind of thing that I have been trying, but
>>>>>> have not succeeded with.
>>>>>>
>>>>>>   SparkSession spark =
>>>>>> SparkSession.builder().remote("sc://localhost").getOrCreate();
>>>>>>   Dataset<Row> s = spark.read().parquet("/tmp/svampeatlas/*");
>>>>>>
>>>>>>   // this works if it is not "global"
>>>>>>   s.createOrReplaceGlobalTempView("occurrence_svampe");
>>>>>>   spark.catalog().cacheTable("occurrence_svampe");
>>>>>>
>>>>>>   // this fails with a table not found when a global view is used
>>>>>>   spark
>>>>>>       .sql("SELECT * FROM occurrence_svampe")
>>>>>>       .write()
>>>>>>       .parquet("/tmp/export");
>>>>>>
>>>>>> Thank you
>>>>>> Tim
>>>>>>
>>>>>

Reply via email to