Hi all, I currently find myself evaluating a use case, where I have to deal with wide (i.e. about 50-60 columns, definitely more than the 25 supported by the Tuple types), structured data from CSV files, with a potentially dynamically (during runtime) generated (or automatically inferred from the CSV file) schema. SparkSQL works very well for this case, because I can generate or infer the schema dynamically at runtime, access fields in UDFs via index or name (via the Row API), generate new schemata for UDF results on the fly, and use those schemata to read and write from/to CSV. Obviously Spark and SparkSQL have other quirks and I'd like to find a good solution to do this with Flink.
The main limitation seems to be that I can't seem to have DataSets of arbitrary-length, arbitrary-type (i.e. unknown during compile time), tuples. The Record API/type looks like it was meant to provide something like that but it appears to become deprecated and is not well supported by the DataSet APIs (e.g. I can't do a join on Records by field index, nor does the CsvReader API support Records), and it has no concept of field names, either. I though about generating Java classes of my schemata on runtime (e.g. via Javassist), but that seems like a hack, and I'd probably have to do this for each intermediate schema as well (e.g. when a map operation alters the schema). I haven't tried this avenue yet, so I'm not certain it would actually work, and even less certain that this is a nice and maintainable solution Can anyone suggest a nice way to deal with this kind of use case? I can prepare an example if that would make it more clear. Thanks, Johann