I believe Flink source is the only place that uses Java serialization for checkpoint state: https://github.com/apache/iceberg/issues/1698.
@OpenInx <open...@gmail.com> already updated Flink sink to avoid the Java serialization (long time ago) On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <yezhao...@gmail.com> wrote: > Yes I totally agree that the distributed system itself should make sure > the integrity of objects passing across nodes. I am more concerned about > the Flink case where some information is persisted and can be modified to > execute arbitrary code. Maybe people working on Flink can comment on this a > bit more. > > Also, this is dangerous from the perspective that I do not need to modify > any existing serialized object, but as long as I can communicate with the > node that tries to run the deserialization logic of incoming payload, it > will execute the code of getObject method even if the deserialization > itself fails. And I believe this is an issue not only for Java > serialization but also for Kryo serialization, basically any serialization > mechanisms that try to bypass using the public constructor. > > I am definitely not an expert on this, I only noticed this while adding > some features to Trino, where the community just completely avoids the use > of Java serialization for such a reason. I think some actions are needed > here: > 1. make sure we fully understand the implication of our decision to use > Java serialization, some warnings are likely needed for organizations that > tries to adopt Iceberg to their own compute > 2. introduce some mechanisms to allow only trusted classes to be > deserialized, which I saw was done to fix some attacks like > https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will explore > the possibility to at least leverage the ValidatingObjectInputStream. > 3. maybe allow some currently protected class constructors to be public, > so that engines at least have the option to have their own deserialization > mechanism when such an attack cannot be fully mitigated on the engine's > side. > > -Jack > > > On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <b...@tabular.io> wrote: > >> Jack, >> >> I might be incorrect here, but I'll at least throw out some thoughts. If >> I understand correctly, the attacker requires access to modify some >> serialized object so that deserialization leads to arbitrary code >> execution. I think that the best way to protect against that is to avoid >> making it possible for an attacker to modify serialized bytes. >> >> To my knowledge, Java serialization is used in two places: first, to >> serialize objects between nodes, like sending a task to a Spark executor, >> and second, to serialize some persistent state in Flink. Iceberg does not >> use Java serialization for anything in the format or long-term storage. For >> the first case, I think that it is up to the distributed system passing >> objects between nodes to secure the content, like using TLS for connections >> between nodes. Since Java serialization is used by the processing engine, >> there isn't much Iceberg could do to change this and we have to rely on >> Spark or Flink. >> >> For the second issue, I think our use of Java serialization to store >> state is very limited, but we should take a look to make sure. I think this >> is one area where Iceberg made the choice to use Java serialization, so we >> should look into it and fix it if possible... although I'm not entirely >> sure how to avoid swapping out the state that gets loaded. >> >> Ryan >> >> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <yezhao...@gmail.com> wrote: >> >>> Hi everyone, >>> >>> We use Java serialization and deserialization a lot in Iceberg. I wonder >>> if we have considered the potential of Java deserialization attack, where >>> an attacker can replace serialized bytes to execute arbitrary code through >>> the readObject method. >>> >>> Currently our SerializationUtil.deserializeFromBytes directly converts >>> bytes to an ObjectInputStream. I know Apache commons have >>> ValidatingObjectInputStream which can prevent the issue to some extent. >>> >>> Have we thought about this issue in the past? Are there any other >>> suggestions? >>> >>> Best, >>> Jack Ye >>> >> >> >> -- >> Ryan Blue >> Tabular >> >