Thanks, Steven. Do you think that there is a potential problem with an
attacker having access to where the state is stored and using that to
inject code? Is this something we should just update to avoid it entirely?

On Mon, Jul 19, 2021 at 3:43 PM Steven Wu <stevenz...@gmail.com> wrote:

> I believe Flink source is the only place that uses Java serialization for
> checkpoint state: https://github.com/apache/iceberg/issues/1698.
>
>  @OpenInx <open...@gmail.com> already updated Flink sink to avoid the
> Java serialization (long time ago)
>
> On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Yes I totally agree that the distributed system itself should make sure
>> the integrity of objects passing across nodes. I am more concerned about
>> the Flink case where some information is persisted and can be modified to
>> execute arbitrary code. Maybe people working on Flink can comment on this a
>> bit more.
>>
>> Also, this is dangerous from the perspective that I do not need to modify
>> any existing serialized object, but as long as I can communicate with the
>> node that tries to run the deserialization logic of incoming payload, it
>> will execute the code of getObject method even if the deserialization
>> itself fails. And I believe this is an issue not only for Java
>> serialization but also for Kryo serialization, basically any serialization
>> mechanisms that try to bypass using the public constructor.
>>
>> I am definitely not an expert on this, I only noticed this while adding
>> some features to Trino, where the community just completely avoids the use
>> of Java serialization for such a reason. I think some actions are needed
>> here:
>> 1. make sure we fully understand the implication of our decision to use
>> Java serialization, some warnings are likely needed for organizations that
>> tries to adopt Iceberg to their own compute
>> 2. introduce some mechanisms to allow only trusted classes to be
>> deserialized, which I saw was done to fix some attacks like
>> https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will explore
>> the possibility to at least leverage the ValidatingObjectInputStream.
>> 3. maybe allow some currently protected class constructors to be public,
>> so that engines at least have the option to have their own deserialization
>> mechanism when such an attack cannot be fully mitigated on the engine's
>> side.
>>
>> -Jack
>>
>>
>> On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Jack,
>>>
>>> I might be incorrect here, but I'll at least throw out some thoughts. If
>>> I understand correctly, the attacker requires access to modify some
>>> serialized object so that deserialization leads to arbitrary code
>>> execution. I think that the best way to protect against that is to avoid
>>> making it possible for an attacker to modify serialized bytes.
>>>
>>> To my knowledge, Java serialization is used in two places: first, to
>>> serialize objects between nodes, like sending a task to a Spark executor,
>>> and second, to serialize some persistent state in Flink. Iceberg does not
>>> use Java serialization for anything in the format or long-term storage. For
>>> the first case, I think that it is up to the distributed system passing
>>> objects between nodes to secure the content, like using TLS for connections
>>> between nodes. Since Java serialization is used by the processing engine,
>>> there isn't much Iceberg could do to change this and we have to rely on
>>> Spark or Flink.
>>>
>>> For the second issue, I think our use of Java serialization to store
>>> state is very limited, but we should take a look to make sure. I think this
>>> is one area where Iceberg made the choice to use Java serialization, so we
>>> should look into it and fix it if possible... although I'm not entirely
>>> sure how to avoid swapping out the state that gets loaded.
>>>
>>> Ryan
>>>
>>> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> We use Java serialization and deserialization a lot in Iceberg. I
>>>> wonder if we have considered the potential of Java deserialization attack,
>>>> where an attacker can replace serialized bytes to execute arbitrary code
>>>> through the readObject method.
>>>>
>>>> Currently our SerializationUtil.deserializeFromBytes directly converts
>>>> bytes to an ObjectInputStream. I know Apache commons have
>>>> ValidatingObjectInputStream which can prevent the issue to some extent.
>>>>
>>>> Have we thought about this issue in the past? Are there any other
>>>> suggestions?
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

-- 
Ryan Blue
Tabular

Reply via email to