Hi Kostas,

thanks for quick answer.

I wouldn't dare to delete files in InputFormat if they were splitted and processed in parallel...

As for using notifyCheckpointComplete - thanks for suggestion, it looks pretty interesting, I'll try to try it out. Although I wonder a bit if relying only on modification timestamp is enough - many things may happen in one ms :)

thanks,

macie


On 18/10/2016 11:14, Kostas Kloudas wrote:
Hi Maciek,

Just a follow-up on the previous email, given that splits are read in parallel, 
when the
ContinuousFileMonitoringFunction forwards the last split, it does not mean that 
the
final splits is going to be processed last. If the node it gets assigned is 
fast enough
then it may be processed faster than others.

This assumption only holds if you have a parallelism of 1.

Cheers,
Kostas

On Oct 18, 2016, at 11:05 AM, Kostas Kloudas <k.klou...@data-artisans.com> 
wrote:

Hi Maciek,

Currently this functionality is not supported but this seems like a good 
addition.
Actually, give that the feature is rather new, we were thinking of opening a 
discussion
in the dev mailing list in order to

i) discuss some current limitations of the Continuous File Processing source
ii) see how people use it and adjust our features accordingly

I will let you know as soon as I open this thread.

By the way for your use-case, we should probably have a callback in the 
notifyCheckpointComplete()
that will inform the source that a given checkpoint was successfully performed 
and then
we can purge the already processed files. This can be a good solution.

Thanks,
Kostas

On Oct 18, 2016, at 9:40 AM, Maciek Próchniak <m...@touk.pl> wrote:

Hi,

we want to monitor hdfs (or local) directory, read csv files that appear and 
after successful processing - delete them (mainly not to run out of disk 
space...)

I'm not quite sure how to achieve it with current implementation. Previously, 
when we read binary data (unsplittable files) we made small hack and deleted 
them

in our FileInputFormat - but now we want to use splits and detecting which 
split is 'the last one' is no longer so obvious - of course it's also 
problematic when it comes to checkpointing...

So my question is - is there a idiomatic way of deleting processed files?


thanks,

maciek



Reply via email to