Hello,

I have multiple files (file1, file2, file3) each being CSV and having
different columns and data. The column headers are finite and we know
their format. I would like to take them and parse them based on the column
structure. I already have the parsers

e.g.:

file1 has columns (id, firstname, lastname)
file2 has columns (id, name)
file3 has columns (id, name_1, name_2, name_3, name_4)

I would like to take all those files, read them, parse them and output
objects to a sink as Person { id, fullName }

Example files would be:

file1:
------
id, firstname, lastname
33, John, Smith
55, Labe, Soni

file2:
------
id, name
5, Mitr Kompi
99, Squi Masw

file3:
------
id, name_1, name_2, name_3, name_4
1, Peter, Hov, Risti, Pena
2, Rii, Koni, Ques,,

Expected output of my program would be:

Person { 33, John Smith }
Person { 55, Labe Soni }
Person { 5, Mitr Kompi }
Person { 99, Squi Masw }
Person { 1, Peter Hov Risti Pena }
Person { 2, Rii Koni Ques }



What I do now is:

My code (very simplified) is: env.readFile().flatMap(new
MyParser()).addSink(new MySink())
The MyParser receives the rows 1 by 1 in string format. Which means that
when I run with parallelism > 1 I receive data from any file and I cannot
say this line comes from where.



What I would like to do is:

Be able to figure out which is the file I am reading from.
Since I only know the file type based on the first row (columns) I need to
either send the 1st row to MyParser() or send a tuple <1st row of file
being read, current row of file being read>.
Another option that I can think about is to have some keyed function based
on the first row, but I am not sure how to achieve that by using readFile.


Is there a way I can achieve this?


Regards
,
Nikola

Reply via email to