This doesn't work as given here ( https://stackoverflow.com/questions/36107581/change-output-filename-prefix-for-dataframe-write) but the answer suggests using FileOutputFormat class. Will try that. Thanks. Regards.
On Sun, Jul 18, 2021 at 12:44 AM Jörn Franke <jornfra...@gmail.com> wrote: > Spark heavily depends on Hadoop writing files. You can try to set the > Hadoop property: mapreduce.output.basename > > > https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration-- > > > Am 18.07.2021 um 01:15 schrieb Eric Beabes <mailinglist...@gmail.com>: > > > Mich - You're suggesting changing the "Path". Problem is that, we've an > EXTERNAL table created on top of this path so "Path" CANNOT change. If we > could, it would be easy to solve this problem. My question is about > changing the "Filename". > > As Ayan pointed out, Spark doesn't seem to allow "prefixes" for the > filenames! > > On Sat, Jul 17, 2021 at 1:58 PM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> Using this >> >> df.write.mode("overwrite").format("parquet").saveAsTable("test.ABCD") >> >> That will create a parquet table in the database test. which is >> essentially a hive partition in the format >> >> /user/hive/warehouse/test.db/abcd/000000_0 >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sat, 17 Jul 2021 at 20:45, Eric Beabes <mailinglist...@gmail.com> >> wrote: >> >>> I am not sure if you've understood the question. Here's how we're saving >>> the DataFrame: >>> >>> df >>> .coalesce(numFiles) >>> .write >>> .partitionBy(partitionDate) >>> .mode("overwrite") >>> .format("parquet") >>> >>> .save(*someDirectory*) >>> >>> >>> Now where would I add a 'prefix' in this one? >>> >>> >>> On Sat, Jul 17, 2021 at 10:54 AM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> try it see if it works >>>> >>>> fullyQualifiedTableName = appName+'_'+tableName >>>> >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Sat, 17 Jul 2021 at 18:02, Eric Beabes <mailinglist...@gmail.com> >>>> wrote: >>>> >>>>> I don't think Spark allows adding a 'prefix' to the file name, does >>>>> it? If it does, please tell me how. Thanks. >>>>> >>>>> On Sat, Jul 17, 2021 at 9:47 AM Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> Jobs have names in spark. You can prefix it to the file name when >>>>>> writing to directory I guess >>>>>> >>>>>> val sparkConf = new SparkConf(). >>>>>> setAppName(sparkAppName). >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sat, 17 Jul 2021 at 17:40, Eric Beabes <mailinglist...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Reason we've two jobs writing to the same directory is that the data >>>>>>> is partitioned by 'day' (yyyymmdd) but the job runs hourly. Maybe the >>>>>>> only >>>>>>> way to do this is to create an hourly partition (/yyyymmdd/hh). Is that >>>>>>> the >>>>>>> only way to solve this? >>>>>>> >>>>>>> On Fri, Jul 16, 2021 at 5:45 PM ayan guha <guha.a...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> IMHO - this is a bad idea esp in failure scenarios. >>>>>>>> >>>>>>>> How about creating a subfolder each for the jobs? >>>>>>>> >>>>>>>> On Sat, 17 Jul 2021 at 9:11 am, Eric Beabes < >>>>>>>> mailinglist...@gmail.com> wrote: >>>>>>>> >>>>>>>>> We've two (or more) jobs that write data into the same directory >>>>>>>>> via a Dataframe.save method. We need to be able to figure out which >>>>>>>>> job >>>>>>>>> wrote which file. Maybe provide a 'prefix' to the file names. I was >>>>>>>>> wondering if there's any 'option' that allows us to do this. Googling >>>>>>>>> didn't come up with any solution so thought of asking the Spark >>>>>>>>> experts on >>>>>>>>> this mailing list. >>>>>>>>> >>>>>>>>> Thanks in advance. >>>>>>>>> >>>>>>>> -- >>>>>>>> Best Regards, >>>>>>>> Ayan Guha >>>>>>>> >>>>>>>