Yeah, it is a general Java serialization wider than just Iceberg tables.
Typically Flink won't recommend Java serialization for checkpoint state, as
that won't be able to support schema evolution. Flink has built-in support
for schema evolution for Pojo or Avro data types.

On Mon, Jul 19, 2021 at 4:09 PM Ryan Blue <b...@tabular.io> wrote:

> Yes, I think so. Sounds like an unsecured bucket could lead to code
> execution running with data infrastructure privileges. While it isn't
> exactly Flink's problem, we should probably treat this like a potential
> privilege escalation issue. How does Flink handle this for other cases? I
> think it would make sense to avoid using Java serialization for checkpoint
> state for Iceberg at least, but it is probably a wider problem than just
> Iceberg tables.
>
> On Mon, Jul 19, 2021 at 3:57 PM Steven Wu <stevenz...@gmail.com> wrote:
>
>> Let's assume the Flink checkpoint state is uploaded to S3. Attacker needs
>> to be able to read from and write to S3 to manipulate the S3 files. Is this
>> the scenario we are concerned about?
>>
>> On Mon, Jul 19, 2021 at 3:51 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Thanks, Steven. Do you think that there is a potential problem with an
>>> attacker having access to where the state is stored and using that to
>>> inject code? Is this something we should just update to avoid it entirely?
>>>
>>> On Mon, Jul 19, 2021 at 3:43 PM Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>>> I believe Flink source is the only place that uses Java serialization
>>>> for checkpoint state: https://github.com/apache/iceberg/issues/1698.
>>>>
>>>>  @OpenInx <open...@gmail.com> already updated Flink sink to avoid the
>>>> Java serialization (long time ago)
>>>>
>>>> On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> Yes I totally agree that the distributed system itself should make
>>>>> sure the integrity of objects passing across nodes. I am more concerned
>>>>> about the Flink case where some information is persisted and can be
>>>>> modified to execute arbitrary code. Maybe people working on Flink can
>>>>> comment on this a bit more.
>>>>>
>>>>> Also, this is dangerous from the perspective that I do not need to
>>>>> modify any existing serialized object, but as long as I can communicate
>>>>> with the node that tries to run the deserialization logic of incoming
>>>>> payload, it will execute the code of getObject method even if the
>>>>> deserialization itself fails. And I believe this is an issue not only for
>>>>> Java serialization but also for Kryo serialization, basically any
>>>>> serialization mechanisms that try to bypass using the public constructor.
>>>>>
>>>>> I am definitely not an expert on this, I only noticed this while
>>>>> adding some features to Trino, where the community just completely avoids
>>>>> the use of Java serialization for such a reason. I think some actions are
>>>>> needed here:
>>>>> 1. make sure we fully understand the implication of our decision to
>>>>> use Java serialization, some warnings are likely needed for organizations
>>>>> that tries to adopt Iceberg to their own compute
>>>>> 2. introduce some mechanisms to allow only trusted classes to be
>>>>> deserialized, which I saw was done to fix some attacks like
>>>>> https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will
>>>>> explore the possibility to at least leverage the
>>>>> ValidatingObjectInputStream.
>>>>> 3. maybe allow some currently protected class constructors to be
>>>>> public, so that engines at least have the option to have their own
>>>>> deserialization mechanism when such an attack cannot be fully mitigated on
>>>>> the engine's side.
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>> On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> Jack,
>>>>>>
>>>>>> I might be incorrect here, but I'll at least throw out some thoughts.
>>>>>> If I understand correctly, the attacker requires access to modify some
>>>>>> serialized object so that deserialization leads to arbitrary code
>>>>>> execution. I think that the best way to protect against that is to avoid
>>>>>> making it possible for an attacker to modify serialized bytes.
>>>>>>
>>>>>> To my knowledge, Java serialization is used in two places: first, to
>>>>>> serialize objects between nodes, like sending a task to a Spark executor,
>>>>>> and second, to serialize some persistent state in Flink. Iceberg does not
>>>>>> use Java serialization for anything in the format or long-term storage. 
>>>>>> For
>>>>>> the first case, I think that it is up to the distributed system passing
>>>>>> objects between nodes to secure the content, like using TLS for 
>>>>>> connections
>>>>>> between nodes. Since Java serialization is used by the processing engine,
>>>>>> there isn't much Iceberg could do to change this and we have to rely on
>>>>>> Spark or Flink.
>>>>>>
>>>>>> For the second issue, I think our use of Java serialization to store
>>>>>> state is very limited, but we should take a look to make sure. I think 
>>>>>> this
>>>>>> is one area where Iceberg made the choice to use Java serialization, so 
>>>>>> we
>>>>>> should look into it and fix it if possible... although I'm not entirely
>>>>>> sure how to avoid swapping out the state that gets loaded.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> We use Java serialization and deserialization a lot in Iceberg. I
>>>>>>> wonder if we have considered the potential of Java deserialization 
>>>>>>> attack,
>>>>>>> where an attacker can replace serialized bytes to execute arbitrary code
>>>>>>> through the readObject method.
>>>>>>>
>>>>>>> Currently our SerializationUtil.deserializeFromBytes directly
>>>>>>> converts bytes to an ObjectInputStream. I know Apache commons have
>>>>>>> ValidatingObjectInputStream which can prevent the issue to some extent.
>>>>>>>
>>>>>>> Have we thought about this issue in the past? Are there any other
>>>>>>> suggestions?
>>>>>>>
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Reply via email to