Yes, I think so. Sounds like an unsecured bucket could lead to code
execution running with data infrastructure privileges. While it isn't
exactly Flink's problem, we should probably treat this like a potential
privilege escalation issue. How does Flink handle this for other cases? I
think it would make sense to avoid using Java serialization for checkpoint
state for Iceberg at least, but it is probably a wider problem than just
Iceberg tables.

On Mon, Jul 19, 2021 at 3:57 PM Steven Wu <stevenz...@gmail.com> wrote:

> Let's assume the Flink checkpoint state is uploaded to S3. Attacker needs
> to be able to read from and write to S3 to manipulate the S3 files. Is this
> the scenario we are concerned about?
>
> On Mon, Jul 19, 2021 at 3:51 PM Ryan Blue <b...@tabular.io> wrote:
>
>> Thanks, Steven. Do you think that there is a potential problem with an
>> attacker having access to where the state is stored and using that to
>> inject code? Is this something we should just update to avoid it entirely?
>>
>> On Mon, Jul 19, 2021 at 3:43 PM Steven Wu <stevenz...@gmail.com> wrote:
>>
>>> I believe Flink source is the only place that uses Java serialization
>>> for checkpoint state: https://github.com/apache/iceberg/issues/1698.
>>>
>>>  @OpenInx <open...@gmail.com> already updated Flink sink to avoid the
>>> Java serialization (long time ago)
>>>
>>> On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> Yes I totally agree that the distributed system itself should make sure
>>>> the integrity of objects passing across nodes. I am more concerned about
>>>> the Flink case where some information is persisted and can be modified to
>>>> execute arbitrary code. Maybe people working on Flink can comment on this a
>>>> bit more.
>>>>
>>>> Also, this is dangerous from the perspective that I do not need to
>>>> modify any existing serialized object, but as long as I can communicate
>>>> with the node that tries to run the deserialization logic of incoming
>>>> payload, it will execute the code of getObject method even if the
>>>> deserialization itself fails. And I believe this is an issue not only for
>>>> Java serialization but also for Kryo serialization, basically any
>>>> serialization mechanisms that try to bypass using the public constructor.
>>>>
>>>> I am definitely not an expert on this, I only noticed this while adding
>>>> some features to Trino, where the community just completely avoids the use
>>>> of Java serialization for such a reason. I think some actions are needed
>>>> here:
>>>> 1. make sure we fully understand the implication of our decision to use
>>>> Java serialization, some warnings are likely needed for organizations that
>>>> tries to adopt Iceberg to their own compute
>>>> 2. introduce some mechanisms to allow only trusted classes to be
>>>> deserialized, which I saw was done to fix some attacks like
>>>> https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will
>>>> explore the possibility to at least leverage the
>>>> ValidatingObjectInputStream.
>>>> 3. maybe allow some currently protected class constructors to be
>>>> public, so that engines at least have the option to have their own
>>>> deserialization mechanism when such an attack cannot be fully mitigated on
>>>> the engine's side.
>>>>
>>>> -Jack
>>>>
>>>>
>>>> On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> Jack,
>>>>>
>>>>> I might be incorrect here, but I'll at least throw out some thoughts.
>>>>> If I understand correctly, the attacker requires access to modify some
>>>>> serialized object so that deserialization leads to arbitrary code
>>>>> execution. I think that the best way to protect against that is to avoid
>>>>> making it possible for an attacker to modify serialized bytes.
>>>>>
>>>>> To my knowledge, Java serialization is used in two places: first, to
>>>>> serialize objects between nodes, like sending a task to a Spark executor,
>>>>> and second, to serialize some persistent state in Flink. Iceberg does not
>>>>> use Java serialization for anything in the format or long-term storage. 
>>>>> For
>>>>> the first case, I think that it is up to the distributed system passing
>>>>> objects between nodes to secure the content, like using TLS for 
>>>>> connections
>>>>> between nodes. Since Java serialization is used by the processing engine,
>>>>> there isn't much Iceberg could do to change this and we have to rely on
>>>>> Spark or Flink.
>>>>>
>>>>> For the second issue, I think our use of Java serialization to store
>>>>> state is very limited, but we should take a look to make sure. I think 
>>>>> this
>>>>> is one area where Iceberg made the choice to use Java serialization, so we
>>>>> should look into it and fix it if possible... although I'm not entirely
>>>>> sure how to avoid swapping out the state that gets loaded.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> We use Java serialization and deserialization a lot in Iceberg. I
>>>>>> wonder if we have considered the potential of Java deserialization 
>>>>>> attack,
>>>>>> where an attacker can replace serialized bytes to execute arbitrary code
>>>>>> through the readObject method.
>>>>>>
>>>>>> Currently our SerializationUtil.deserializeFromBytes directly
>>>>>> converts bytes to an ObjectInputStream. I know Apache commons have
>>>>>> ValidatingObjectInputStream which can prevent the issue to some extent.
>>>>>>
>>>>>> Have we thought about this issue in the past? Are there any other
>>>>>> suggestions?
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Reply via email to