Yes I totally agree that the distributed system itself should make sure the
integrity of objects passing across nodes. I am more concerned about the
Flink case where some information is persisted and can be modified to
execute arbitrary code. Maybe people working on Flink can comment on this a
bit more.

Also, this is dangerous from the perspective that I do not need to modify
any existing serialized object, but as long as I can communicate with the
node that tries to run the deserialization logic of incoming payload, it
will execute the code of getObject method even if the deserialization
itself fails. And I believe this is an issue not only for Java
serialization but also for Kryo serialization, basically any serialization
mechanisms that try to bypass using the public constructor.

I am definitely not an expert on this, I only noticed this while adding
some features to Trino, where the community just completely avoids the use
of Java serialization for such a reason. I think some actions are needed
here:
1. make sure we fully understand the implication of our decision to use
Java serialization, some warnings are likely needed for organizations that
tries to adopt Iceberg to their own compute
2. introduce some mechanisms to allow only trusted classes to be
deserialized, which I saw was done to fix some attacks like
https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will explore the
possibility to at least leverage the ValidatingObjectInputStream.
3. maybe allow some currently protected class constructors to be public, so
that engines at least have the option to have their own deserialization
mechanism when such an attack cannot be fully mitigated on the engine's
side.

-Jack


On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <b...@tabular.io> wrote:

> Jack,
>
> I might be incorrect here, but I'll at least throw out some thoughts. If I
> understand correctly, the attacker requires access to modify some
> serialized object so that deserialization leads to arbitrary code
> execution. I think that the best way to protect against that is to avoid
> making it possible for an attacker to modify serialized bytes.
>
> To my knowledge, Java serialization is used in two places: first, to
> serialize objects between nodes, like sending a task to a Spark executor,
> and second, to serialize some persistent state in Flink. Iceberg does not
> use Java serialization for anything in the format or long-term storage. For
> the first case, I think that it is up to the distributed system passing
> objects between nodes to secure the content, like using TLS for connections
> between nodes. Since Java serialization is used by the processing engine,
> there isn't much Iceberg could do to change this and we have to rely on
> Spark or Flink.
>
> For the second issue, I think our use of Java serialization to store state
> is very limited, but we should take a look to make sure. I think this is
> one area where Iceberg made the choice to use Java serialization, so we
> should look into it and fix it if possible... although I'm not entirely
> sure how to avoid swapping out the state that gets loaded.
>
> Ryan
>
> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> We use Java serialization and deserialization a lot in Iceberg. I wonder
>> if we have considered the potential of Java deserialization attack, where
>> an attacker can replace serialized bytes to execute arbitrary code through
>> the readObject method.
>>
>> Currently our SerializationUtil.deserializeFromBytes directly converts
>> bytes to an ObjectInputStream. I know Apache commons have
>> ValidatingObjectInputStream which can prevent the issue to some extent.
>>
>> Have we thought about this issue in the past? Are there any other
>> suggestions?
>>
>> Best,
>> Jack Ye
>>
>
>
> --
> Ryan Blue
> Tabular
>

Reply via email to