Hi Andrea, How large are these data files? The implementation you've mentioned here is only usable if they are very small. If so, you're fine. If not read on...
Processing XML input files in parallel is tricky. It's not a great format for this type of processing as you've seen. They are tricky to split and more complex to iterate through than simpler formats. However, others have implemented XMLInputFormat classes for Hadoop. Have you looked at these? Mahout has an XMLInputFormat implementation for example but I haven't used it directly. Anyway, you can reuse Hadoop InputFormat implementations in Flink directly. This is likely a good route. See Flink's HadoopInputFormat class. -Jamie On Tue, Jun 7, 2016 at 7:35 AM, Andrea Cisternino <a.cistern...@gmail.com> wrote: > Hi all, > > I am evaluating Apache Flink for processing large sets of Geospatial data. > The use case I am working on will involve reading a certain number of GPX > files stored on Amazon S3. > > GPX files are actually XML files and therefore cannot be read on a line by > line basis. > One GPX file will produce one or more Java objects that will contain the > geospatial data we need to process (mostly a list of geographical points). > > To cover this use case I tried to extend the FileInputFormat class: > > public class WholeFileInputFormat extends FileInputFormat<String> > { > private boolean hasReachedEnd = false; > > public WholeFileInputFormat() { > unsplittable = true; > } > > @Override > public void open(FileInputSplit fileSplit) throws IOException { > super.open(fileSplit); > hasReachedEnd = false; > } > > @Override > public String nextRecord(String reuse) throws IOException { > // uses apache.commons.io.IOUtils > String fileContent = IOUtils.toString(stream, StandardCharsets.UTF_8); > hasReachedEnd = true; > return fileContent; > } > > @Override > public boolean reachedEnd() throws IOException { > return hasReachedEnd; > } > } > > This class returns the content of the whole file as a string. > > Is this the right approach? > It seems to work when run locally with local files but I wonder if it would > run into problems when tested in a cluster. > > Thanks in advance. > Andrea. > > -- > Andrea Cisternino, Erlangen, Germany > GitHub: http://github.com/acisternino > GitLab: https://gitlab.com/u/acisternino > -- Jamie Grier data Artisans, Director of Applications Engineering @jamiegrier <https://twitter.com/jamiegrier> ja...@data-artisans.com