Hi:
I am working on a project where a few thousand text files (~20M in size) will
be dropped in an hdfs directory every 15 minutes. Data from the file will used
to update counters in cassandra (non-idempotent operation). I was wondering
what is the best to deal with this:
* Use text streaming and process the files as they are added to the
directory
* Use non-streaming text input and launch a spark driver every 15
minutes to process files from a specified directory (new directory for every 15
minutes).
* Use message queue to ingest data from the files and then read data
from the queue.
Also, is there a way to to find which text file is being processed and when a
file has been processed for both the streaming and non-streaming RDDs. I
believe filename is available in the WholeTextFileInputFormat but is it
available in standard or streaming text RDDs.
Thanks
Mans