Hi Eric - yes that maybe the best way to resolve this. I have not seen any
specific way to define names of the actual files written by spark. Finally,
make sure you optimize number of files written.

On Sun, Jul 18, 2021 at 2:39 AM Eric Beabes <mailinglist...@gmail.com>
wrote:

> Reason we've two jobs writing to the same directory is that the data is
> partitioned by 'day' (yyyymmdd) but the job runs hourly. Maybe the only way
> to do this is to create an hourly partition (/yyyymmdd/hh). Is that the
> only way to solve this?
>
> On Fri, Jul 16, 2021 at 5:45 PM ayan guha <guha.a...@gmail.com> wrote:
>
>> IMHO - this is a bad idea esp in failure scenarios.
>>
>> How about creating a subfolder each for the jobs?
>>
>> On Sat, 17 Jul 2021 at 9:11 am, Eric Beabes <mailinglist...@gmail.com>
>> wrote:
>>
>>> We've two (or more) jobs that write data into the same directory via a
>>> Dataframe.save method. We need to be able to figure out which job wrote
>>> which file. Maybe provide a 'prefix' to the file names. I was wondering if
>>> there's any 'option' that allows us to do this. Googling didn't come up
>>> with any solution so thought of asking the Spark experts on this mailing
>>> list.
>>>
>>> Thanks in advance.
>>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>

-- 
Best Regards,
Ayan Guha

Reply via email to