Thanks Mich. I understood what I am supposed to do now, will try these options.
I still dont understand how the spark will split the large file. I have a 10 GB file which I want to split automatically after reading. I can split and load the file before reading but it is a very big requirement change for all our data pipeline. Is there a way to split the file once it is read to achieve parallelism ? I will group groupby on one column to see if that improves my job. On Wed, Mar 24, 2021 at 10:56 AM Mich Talebzadeh <[email protected]> wrote: > How does Spark establish there is a csv header as a matter of interest? > > Example > > val df = spark.read.option("header", true).csv(location) > > I need to tell spark to ignore the header correct? > > From Spark Read CSV file into DataFrame — SparkByExamples > <https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/> > > If you have a header with column names on file, you need to explicitly > specify true for header option using option("header",true) > <https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/#header> > not > mentioning this, the API treats header as a data record. > > Second point which may not be applicable to the newer versions of Spark. My > understanding is that the gz file is not splittable, therefore Spark needs > to read the whole file using a single core which will slow things down (CPU > intensive). After the read is done the data can be shuffled to increase > parallelism. > > HTH > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 24 Mar 2021 at 12:40, Sean Owen <[email protected]> wrote: > >> No need to do that. Reading the header with Spark automatically is >> trivial. >> >> On Wed, Mar 24, 2021 at 5:25 AM Mich Talebzadeh < >> [email protected]> wrote: >> >>> If it is a csv then it is a flat file somewhere in a directory I guess. >>> >>> Get the header out by doing >>> >>> */usr/bin/zcat csvfile.gz |head -n 1* >>> Title Number,Tenure,Property >>> Address,District,County,Region,Postcode,Multiple Address Indicator,Price >>> Paid,Proprietor Name (1),Company Registration No. (1),Proprietorship >>> Category (1),Country Incorporated (1),Proprietor (1) Address (1),Proprietor >>> (1) Address (2),Proprietor (1) Address (3),Proprietor Name (2),Company >>> Registration No. (2),Proprietorship Category (2),Country Incorporated >>> (2),Proprietor (2) Address (1),Proprietor (2) Address (2),Proprietor (2) >>> Address (3),Proprietor Name (3),Company Registration No. (3),Proprietorship >>> Category (3),Country Incorporated (3),Proprietor (3) Address (1),Proprietor >>> (3) Address (2),Proprietor (3) Address (3),Proprietor Name (4),Company >>> Registration No. (4),Proprietorship Category (4),Country Incorporated >>> (4),Proprietor (4) Address (1),Proprietor (4) Address (2),Proprietor (4) >>> Address (3),Date Proprietor Added,Additional Proprietor Indicator >>> >>> >>> 10GB is not much of a big CSV file >>> >>> that will resolve the header anyway. >>> >>> >>> Also how are you running the spark, in a local mode (single jvm) or >>> other distributed modes (yarn, standalone) ? >>> >>> >>> HTH >>> >>
