Hi all,

I currently find myself evaluating a use case, where I have to deal
with wide (i.e. about 50-60 columns, definitely more than the 25
supported by the Tuple types), structured data from CSV files, with a
potentially dynamically (during runtime) generated (or automatically
inferred from the CSV file) schema.
SparkSQL works very well for this case, because I can generate or
infer the schema dynamically at runtime, access fields in UDFs via
index or name (via the Row API), generate new schemata for UDF results
on the fly, and use those schemata to read and write from/to CSV.
Obviously Spark and SparkSQL have other quirks and I'd like to find a
good solution to do this with Flink.

The main limitation seems to be that I can't seem to have DataSets of
arbitrary-length, arbitrary-type (i.e. unknown during compile time),
tuples. The Record API/type looks like it was meant to provide
something like that but it appears to become deprecated and is not
well supported by the DataSet APIs (e.g. I can't do a join on Records
by field index, nor does the CsvReader API support Records), and it
has no concept of field names, either.

I though about generating Java classes of my schemata on runtime (e.g.
via Javassist), but that seems like a hack, and I'd probably have to
do this for each intermediate schema as well (e.g. when a map
operation alters the schema). I haven't tried this avenue yet, so I'm
not certain it would actually work, and even less certain that this is
a nice and maintainable solution

Can anyone suggest a nice way to deal with this kind of use case? I
can prepare an example if that would make it more clear.

Thanks,
Johann

Reply via email to