Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-06 Thread Winston Lai
When your memory is not sufficient to keep the cached data for your jobs in two different stages, it might be read twice because Spark might have to clear the previous cache for other jobs. In those cases, a spill may triggered when Spark write your data from memory to disk. One way to to check

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-06 Thread Mich Talebzadeh
you can create DF from your SQL RS and work with that in Python the way you want ## you don't need all these import findspark findspark.init() from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.functions import udf, col, curren

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-06 Thread Mich Talebzadeh
So what are you intending to do with the resultset produced? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_