Re: Spark connect: Table caching for global use?

Mich Talebzadeh Sun, 16 Feb 2025 09:26:52 -0800

Ok let us look at this

   -


   Temporary Views,  Metadata is stored on the driver; *data remains
   distributed across executors.*
   -

   Caching/Persisting, *Data is stored in the executors' memory or disk. *
   -

   The statement *"created on driver memory"* refers to the metadata of
   temporary views, not the actual data. The data itself is not loaded into
   the driver unless explicitly collected.


In summary:

   - Data is stored in the executors' memory or disk during normal
   operations.
   - The driver only holds metadata unless you explicitly collect data to
   it.
   - Temporary views and caching/persisting are different mechanisms with
   different memory implications.

HTH

Dr Mich Talebzadeh,

Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Sun, 16 Feb 2025 at 13:29, Tim Robertson <timrobertson...@gmail.com>
wrote:

> Thanks Mich
>
> > created on driver memory
>
> That I hadn't anticipated. Are you sure?
> I understood that caching a table pegged the RDD partitions into the
> memory of the executors holding the partition.
>
>
>
>
> On Sun, Feb 16, 2025 at 11:17 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> yep. created on driver memory. watch for OOM if the size becomes too large
>>
>> spark-submit --driver-memory 8G ...
>>
>> HTH
>>
>> Dr Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Sun, 16 Feb 2025 at 09:16, Tim Robertson <timrobertson...@gmail.com>
>> wrote:
>>
>>> Answering my own question. Global temp views get created in the
>>> global_temp database, so can be accessed thusly.
>>>
>>> Thanks
>>>
>>> Dataset<Row> s = spark.read().parquet("/tmp/svampeatlas/*");
>>> s.createOrReplaceGlobalTempView("occurrence_svampe");
>>> spark.catalog().cacheTable("global_temp.occurrence_svampe");
>>>
>>>
>>> On Sun, Feb 16, 2025 at 10:05 AM Tim Robertson <
>>> timrobertson...@gmail.com> wrote:
>>>
>>>> Hi folks
>>>>
>>>> Is it possible to cache a table for shared use across sessions with
>>>> spark connect?
>>>> I'd like to load a read only table once that many sessions will then
>>>> query to improve performance.
>>>>
>>>> This is an example of the kind of thing that I have been trying, but
>>>> have not succeeded with.
>>>>
>>>>   SparkSession spark =
>>>> SparkSession.builder().remote("sc://localhost").getOrCreate();
>>>>   Dataset<Row> s = spark.read().parquet("/tmp/svampeatlas/*");
>>>>
>>>>   // this works if it is not "global"
>>>>   s.createOrReplaceGlobalTempView("occurrence_svampe");
>>>>   spark.catalog().cacheTable("occurrence_svampe");
>>>>
>>>>   // this fails with a table not found when a global view is used
>>>>   spark
>>>>       .sql("SELECT * FROM occurrence_svampe")
>>>>       .write()
>>>>       .parquet("/tmp/export");
>>>>
>>>> Thank you
>>>> Tim
>>>>
>>>

Re: Spark connect: Table caching for global use?

Reply via email to