Hi Andrea,

How large are these data files?  The implementation you've mentioned here
is only usable if they are very small.  If so, you're fine.  If not read
on...

Processing XML input files in parallel is tricky.  It's not a great format
for this type of processing as you've seen.  They are tricky to split and
more complex to iterate through than simpler formats. However, others have
implemented XMLInputFormat classes for Hadoop.  Have you looked at these?
Mahout has an XMLInputFormat implementation for example but I haven't used
it directly.

Anyway, you can reuse Hadoop InputFormat implementations in Flink
directly.  This is likely a good route.  See Flink's HadoopInputFormat
class.

-Jamie


On Tue, Jun 7, 2016 at 7:35 AM, Andrea Cisternino <a.cistern...@gmail.com>
wrote:

> Hi all,
>
> I am evaluating Apache Flink for processing large sets of Geospatial data.
> The use case I am working on will involve reading a certain number of GPX
> files stored on Amazon S3.
>
> GPX files are actually XML files and therefore cannot be read on a line by
> line basis.
> One GPX file will produce one or more Java objects that will contain the
> geospatial data we need to process (mostly a list of geographical points).
>
> To cover this use case I tried to extend the FileInputFormat class:
>
> public class WholeFileInputFormat extends FileInputFormat<String>
> {
>   private boolean hasReachedEnd = false;
>
>   public WholeFileInputFormat() {
>     unsplittable = true;
>   }
>
>   @Override
>   public void open(FileInputSplit fileSplit) throws IOException {
>     super.open(fileSplit);
>     hasReachedEnd = false;
>   }
>
>   @Override
>   public String nextRecord(String reuse) throws IOException {
>     // uses apache.commons.io.IOUtils
>     String fileContent = IOUtils.toString(stream, StandardCharsets.UTF_8);
>     hasReachedEnd = true;
>     return fileContent;
>   }
>
>   @Override
>   public boolean reachedEnd() throws IOException {
>     return hasReachedEnd;
>   }
> }
>
> This class returns the content of the whole file as a string.
>
> Is this the right approach?
> It seems to work when run locally with local files but I wonder if it would
> run into problems when tested in a cluster.
>
> Thanks in advance.
>   Andrea.
>
> --
> Andrea Cisternino, Erlangen, Germany
> GitHub: http://github.com/acisternino
> GitLab: https://gitlab.com/u/acisternino
>



-- 

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com

Reply via email to