Hi all,

We’ve got time-stamped directories containing text files, stored in HDFS.

We can regularly get new files added, so we’re using a FileSource with a 
monitoring duration, so that it continuously picks up any new files.

The challenge is that we need to include the parent directory’s timestamp in 
the output, for doing time-window joins of this enrichment data with another 
stream.

Previously I could extend with the input format 
<https://stackoverflow.com/a/68764550/231762> to extract path information, and 
emit a Tuple2<LongWritable, Text>.

But with the new FileSource architecture, I’m really not sure if it’s possible, 
or if so, the right way to go about doing it.

I’ve wandered through the source code (FileSource, AbstractFileSource, 
SourceReader, FileSourceReader, FileSourceSplit, ad nauseam) but haven’t seen 
any happy path to making that all work.

There might be a way using some really ugly hacks to TextLineFormat, where it 
would reverse engineer the FSDataInputStream to try to find information about 
the original file, but feels very fragile.

Any suggestions?

Thanks!

— Ken


--------------------------
Ken Krugler
http://www.scaleunlimited.com
Custom big data solutions
Flink, Pinot, Solr, Elasticsearch



Reply via email to