Yeah, it is a general Java serialization wider than just Iceberg tables. Typically Flink won't recommend Java serialization for checkpoint state, as that won't be able to support schema evolution. Flink has built-in support for schema evolution for Pojo or Avro data types.
On Mon, Jul 19, 2021 at 4:09 PM Ryan Blue <b...@tabular.io> wrote: > Yes, I think so. Sounds like an unsecured bucket could lead to code > execution running with data infrastructure privileges. While it isn't > exactly Flink's problem, we should probably treat this like a potential > privilege escalation issue. How does Flink handle this for other cases? I > think it would make sense to avoid using Java serialization for checkpoint > state for Iceberg at least, but it is probably a wider problem than just > Iceberg tables. > > On Mon, Jul 19, 2021 at 3:57 PM Steven Wu <stevenz...@gmail.com> wrote: > >> Let's assume the Flink checkpoint state is uploaded to S3. Attacker needs >> to be able to read from and write to S3 to manipulate the S3 files. Is this >> the scenario we are concerned about? >> >> On Mon, Jul 19, 2021 at 3:51 PM Ryan Blue <b...@tabular.io> wrote: >> >>> Thanks, Steven. Do you think that there is a potential problem with an >>> attacker having access to where the state is stored and using that to >>> inject code? Is this something we should just update to avoid it entirely? >>> >>> On Mon, Jul 19, 2021 at 3:43 PM Steven Wu <stevenz...@gmail.com> wrote: >>> >>>> I believe Flink source is the only place that uses Java serialization >>>> for checkpoint state: https://github.com/apache/iceberg/issues/1698. >>>> >>>> @OpenInx <open...@gmail.com> already updated Flink sink to avoid the >>>> Java serialization (long time ago) >>>> >>>> On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> Yes I totally agree that the distributed system itself should make >>>>> sure the integrity of objects passing across nodes. I am more concerned >>>>> about the Flink case where some information is persisted and can be >>>>> modified to execute arbitrary code. Maybe people working on Flink can >>>>> comment on this a bit more. >>>>> >>>>> Also, this is dangerous from the perspective that I do not need to >>>>> modify any existing serialized object, but as long as I can communicate >>>>> with the node that tries to run the deserialization logic of incoming >>>>> payload, it will execute the code of getObject method even if the >>>>> deserialization itself fails. And I believe this is an issue not only for >>>>> Java serialization but also for Kryo serialization, basically any >>>>> serialization mechanisms that try to bypass using the public constructor. >>>>> >>>>> I am definitely not an expert on this, I only noticed this while >>>>> adding some features to Trino, where the community just completely avoids >>>>> the use of Java serialization for such a reason. I think some actions are >>>>> needed here: >>>>> 1. make sure we fully understand the implication of our decision to >>>>> use Java serialization, some warnings are likely needed for organizations >>>>> that tries to adopt Iceberg to their own compute >>>>> 2. introduce some mechanisms to allow only trusted classes to be >>>>> deserialized, which I saw was done to fix some attacks like >>>>> https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will >>>>> explore the possibility to at least leverage the >>>>> ValidatingObjectInputStream. >>>>> 3. maybe allow some currently protected class constructors to be >>>>> public, so that engines at least have the option to have their own >>>>> deserialization mechanism when such an attack cannot be fully mitigated on >>>>> the engine's side. >>>>> >>>>> -Jack >>>>> >>>>> >>>>> On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>>> Jack, >>>>>> >>>>>> I might be incorrect here, but I'll at least throw out some thoughts. >>>>>> If I understand correctly, the attacker requires access to modify some >>>>>> serialized object so that deserialization leads to arbitrary code >>>>>> execution. I think that the best way to protect against that is to avoid >>>>>> making it possible for an attacker to modify serialized bytes. >>>>>> >>>>>> To my knowledge, Java serialization is used in two places: first, to >>>>>> serialize objects between nodes, like sending a task to a Spark executor, >>>>>> and second, to serialize some persistent state in Flink. Iceberg does not >>>>>> use Java serialization for anything in the format or long-term storage. >>>>>> For >>>>>> the first case, I think that it is up to the distributed system passing >>>>>> objects between nodes to secure the content, like using TLS for >>>>>> connections >>>>>> between nodes. Since Java serialization is used by the processing engine, >>>>>> there isn't much Iceberg could do to change this and we have to rely on >>>>>> Spark or Flink. >>>>>> >>>>>> For the second issue, I think our use of Java serialization to store >>>>>> state is very limited, but we should take a look to make sure. I think >>>>>> this >>>>>> is one area where Iceberg made the choice to use Java serialization, so >>>>>> we >>>>>> should look into it and fix it if possible... although I'm not entirely >>>>>> sure how to avoid swapping out the state that gets loaded. >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> We use Java serialization and deserialization a lot in Iceberg. I >>>>>>> wonder if we have considered the potential of Java deserialization >>>>>>> attack, >>>>>>> where an attacker can replace serialized bytes to execute arbitrary code >>>>>>> through the readObject method. >>>>>>> >>>>>>> Currently our SerializationUtil.deserializeFromBytes directly >>>>>>> converts bytes to an ObjectInputStream. I know Apache commons have >>>>>>> ValidatingObjectInputStream which can prevent the issue to some extent. >>>>>>> >>>>>>> Have we thought about this issue in the past? Are there any other >>>>>>> suggestions? >>>>>>> >>>>>>> Best, >>>>>>> Jack Ye >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >