Hi all, We’ve got time-stamped directories containing text files, stored in HDFS.
We can regularly get new files added, so we’re using a FileSource with a monitoring duration, so that it continuously picks up any new files. The challenge is that we need to include the parent directory’s timestamp in the output, for doing time-window joins of this enrichment data with another stream. Previously I could extend with the input format <https://stackoverflow.com/a/68764550/231762> to extract path information, and emit a Tuple2<LongWritable, Text>. But with the new FileSource architecture, I’m really not sure if it’s possible, or if so, the right way to go about doing it. I’ve wandered through the source code (FileSource, AbstractFileSource, SourceReader, FileSourceReader, FileSourceSplit, ad nauseam) but haven’t seen any happy path to making that all work. There might be a way using some really ugly hacks to TextLineFormat, where it would reverse engineer the FSDataInputStream to try to find information about the original file, but feels very fragile. Any suggestions? Thanks! — Ken -------------------------- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch