Le 20/01/2020 à 16:26, Jacques Nadeau a écrit : > I think it is too late in the game to make this fundamental change. It > would be very hard to assess whether it is no op or has massive > implications to existing datasets. Just among Dremio customers in the 30 > days we stored more than 100mm datasets that leveraged the current format.
To be clear, I agree that we need to check that our various validation and integration suites pass properly. But once that is done and assuming all the metadata variations are properly tested, data variations should not pose any problem. > I'm supportive of enforcing non nulls on the write side but I don't think > we should change the current read behavior. The write side is irrelevant here, since the concern is to protect reliably against invalid input (especially due to malicious intent). The read behaviour would be kept unchanged in the face of *valid* input - but it would become deterministic and robust in the face of *invalid* input - which it isn't today. Of course, we can hand-write all the NULL checks on the read side. My concern is not the one-time cost of doing so, but the long-term fragility of such a strategy (every refactor or format addition is a threat to the robustness of the IPC reader). I don't think a potential long-standing history of security issues in Arrow would help adoption. Regards Antoine.