Well I don't know what having an "in-memory Spark only" is going to achieve. Spark GUI shows the amount of disk usage pretty well. The memory is used exclusively by default first.
Spark is no different from a predominantly in-memory application. Effectively it is doing the classical disk based hadoop map-reduce operation "in memory" to speed up the processing but it is still an application on top of the OS. So like mose applications, there is a state of Spark, the code running and the OS(s), where disk usage will be needed. This is akin to swap space on OS itself and I quote "Swap space is used when your operating system decides that it needs physical memory for active processes and the amount of available (unused) physical memory is insufficient. When this happens, inactive pages from the physical memory are then moved into the swap space, freeing up that physical memory for other uses" free total used free shared buff/cache available Mem: 65659732 30116700 1429436 2341772 34113596 32665372 Swap: 104857596 550912 104306684 HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 20 Aug 2021 at 12:50, Jacek Laskowski <ja...@japila.pl> wrote: > Hi, > > I've been exploring BlockManager and the stores for a while now and am > tempted to say that a memory-only Spark setup would be possible (except > shuffle blocks). Is this correct? > > What about shuffle blocks? Do they have to be stored on disk (in > DiskStore)? > > I think broadcast variables are in-memory first so except on-disk storage > level explicitly used (by Spark devs), there's no reason not to have Spark > in-memory only. > > (I was told that one of the differences between Trino/Presto vs Spark SQL > is that Trino keeps all processing in-memory only and will blow up while > Spark uses disk to avoid OOMEs). > > Pozdrawiam, > Jacek Laskowski > ---- > https://about.me/JacekLaskowski > "The Internals Of" Online Books <https://books.japila.pl/> > Follow me on https://twitter.com/jaceklaskowski > > <https://twitter.com/jaceklaskowski> >