I believe Flink source is the only place that uses Java serialization for
checkpoint state: https://github.com/apache/iceberg/issues/1698.

 @OpenInx <open...@gmail.com> already updated Flink sink to avoid the Java
serialization (long time ago)

On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <yezhao...@gmail.com> wrote:

> Yes I totally agree that the distributed system itself should make sure
> the integrity of objects passing across nodes. I am more concerned about
> the Flink case where some information is persisted and can be modified to
> execute arbitrary code. Maybe people working on Flink can comment on this a
> bit more.
>
> Also, this is dangerous from the perspective that I do not need to modify
> any existing serialized object, but as long as I can communicate with the
> node that tries to run the deserialization logic of incoming payload, it
> will execute the code of getObject method even if the deserialization
> itself fails. And I believe this is an issue not only for Java
> serialization but also for Kryo serialization, basically any serialization
> mechanisms that try to bypass using the public constructor.
>
> I am definitely not an expert on this, I only noticed this while adding
> some features to Trino, where the community just completely avoids the use
> of Java serialization for such a reason. I think some actions are needed
> here:
> 1. make sure we fully understand the implication of our decision to use
> Java serialization, some warnings are likely needed for organizations that
> tries to adopt Iceberg to their own compute
> 2. introduce some mechanisms to allow only trusted classes to be
> deserialized, which I saw was done to fix some attacks like
> https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will explore
> the possibility to at least leverage the ValidatingObjectInputStream.
> 3. maybe allow some currently protected class constructors to be public,
> so that engines at least have the option to have their own deserialization
> mechanism when such an attack cannot be fully mitigated on the engine's
> side.
>
> -Jack
>
>
> On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <b...@tabular.io> wrote:
>
>> Jack,
>>
>> I might be incorrect here, but I'll at least throw out some thoughts. If
>> I understand correctly, the attacker requires access to modify some
>> serialized object so that deserialization leads to arbitrary code
>> execution. I think that the best way to protect against that is to avoid
>> making it possible for an attacker to modify serialized bytes.
>>
>> To my knowledge, Java serialization is used in two places: first, to
>> serialize objects between nodes, like sending a task to a Spark executor,
>> and second, to serialize some persistent state in Flink. Iceberg does not
>> use Java serialization for anything in the format or long-term storage. For
>> the first case, I think that it is up to the distributed system passing
>> objects between nodes to secure the content, like using TLS for connections
>> between nodes. Since Java serialization is used by the processing engine,
>> there isn't much Iceberg could do to change this and we have to rely on
>> Spark or Flink.
>>
>> For the second issue, I think our use of Java serialization to store
>> state is very limited, but we should take a look to make sure. I think this
>> is one area where Iceberg made the choice to use Java serialization, so we
>> should look into it and fix it if possible... although I'm not entirely
>> sure how to avoid swapping out the state that gets loaded.
>>
>> Ryan
>>
>> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> We use Java serialization and deserialization a lot in Iceberg. I wonder
>>> if we have considered the potential of Java deserialization attack, where
>>> an attacker can replace serialized bytes to execute arbitrary code through
>>> the readObject method.
>>>
>>> Currently our SerializationUtil.deserializeFromBytes directly converts
>>> bytes to an ObjectInputStream. I know Apache commons have
>>> ValidatingObjectInputStream which can prevent the issue to some extent.
>>>
>>> Have we thought about this issue in the past? Are there any other
>>> suggestions?
>>>
>>> Best,
>>> Jack Ye
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Reply via email to