Jack, I might be incorrect here, but I'll at least throw out some thoughts. If I understand correctly, the attacker requires access to modify some serialized object so that deserialization leads to arbitrary code execution. I think that the best way to protect against that is to avoid making it possible for an attacker to modify serialized bytes.
To my knowledge, Java serialization is used in two places: first, to serialize objects between nodes, like sending a task to a Spark executor, and second, to serialize some persistent state in Flink. Iceberg does not use Java serialization for anything in the format or long-term storage. For the first case, I think that it is up to the distributed system passing objects between nodes to secure the content, like using TLS for connections between nodes. Since Java serialization is used by the processing engine, there isn't much Iceberg could do to change this and we have to rely on Spark or Flink. For the second issue, I think our use of Java serialization to store state is very limited, but we should take a look to make sure. I think this is one area where Iceberg made the choice to use Java serialization, so we should look into it and fix it if possible... although I'm not entirely sure how to avoid swapping out the state that gets loaded. Ryan On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <yezhao...@gmail.com> wrote: > Hi everyone, > > We use Java serialization and deserialization a lot in Iceberg. I wonder > if we have considered the potential of Java deserialization attack, where > an attacker can replace serialized bytes to execute arbitrary code through > the readObject method. > > Currently our SerializationUtil.deserializeFromBytes directly converts > bytes to an ObjectInputStream. I know Apache commons have > ValidatingObjectInputStream which can prevent the issue to some extent. > > Have we thought about this issue in the past? Are there any other > suggestions? > > Best, > Jack Ye > -- Ryan Blue Tabular