Yes, I think so. Sounds like an unsecured bucket could lead to code execution running with data infrastructure privileges. While it isn't exactly Flink's problem, we should probably treat this like a potential privilege escalation issue. How does Flink handle this for other cases? I think it would make sense to avoid using Java serialization for checkpoint state for Iceberg at least, but it is probably a wider problem than just Iceberg tables.
On Mon, Jul 19, 2021 at 3:57 PM Steven Wu <stevenz...@gmail.com> wrote: > Let's assume the Flink checkpoint state is uploaded to S3. Attacker needs > to be able to read from and write to S3 to manipulate the S3 files. Is this > the scenario we are concerned about? > > On Mon, Jul 19, 2021 at 3:51 PM Ryan Blue <b...@tabular.io> wrote: > >> Thanks, Steven. Do you think that there is a potential problem with an >> attacker having access to where the state is stored and using that to >> inject code? Is this something we should just update to avoid it entirely? >> >> On Mon, Jul 19, 2021 at 3:43 PM Steven Wu <stevenz...@gmail.com> wrote: >> >>> I believe Flink source is the only place that uses Java serialization >>> for checkpoint state: https://github.com/apache/iceberg/issues/1698. >>> >>> @OpenInx <open...@gmail.com> already updated Flink sink to avoid the >>> Java serialization (long time ago) >>> >>> On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> Yes I totally agree that the distributed system itself should make sure >>>> the integrity of objects passing across nodes. I am more concerned about >>>> the Flink case where some information is persisted and can be modified to >>>> execute arbitrary code. Maybe people working on Flink can comment on this a >>>> bit more. >>>> >>>> Also, this is dangerous from the perspective that I do not need to >>>> modify any existing serialized object, but as long as I can communicate >>>> with the node that tries to run the deserialization logic of incoming >>>> payload, it will execute the code of getObject method even if the >>>> deserialization itself fails. And I believe this is an issue not only for >>>> Java serialization but also for Kryo serialization, basically any >>>> serialization mechanisms that try to bypass using the public constructor. >>>> >>>> I am definitely not an expert on this, I only noticed this while adding >>>> some features to Trino, where the community just completely avoids the use >>>> of Java serialization for such a reason. I think some actions are needed >>>> here: >>>> 1. make sure we fully understand the implication of our decision to use >>>> Java serialization, some warnings are likely needed for organizations that >>>> tries to adopt Iceberg to their own compute >>>> 2. introduce some mechanisms to allow only trusted classes to be >>>> deserialized, which I saw was done to fix some attacks like >>>> https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will >>>> explore the possibility to at least leverage the >>>> ValidatingObjectInputStream. >>>> 3. maybe allow some currently protected class constructors to be >>>> public, so that engines at least have the option to have their own >>>> deserialization mechanism when such an attack cannot be fully mitigated on >>>> the engine's side. >>>> >>>> -Jack >>>> >>>> >>>> On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> Jack, >>>>> >>>>> I might be incorrect here, but I'll at least throw out some thoughts. >>>>> If I understand correctly, the attacker requires access to modify some >>>>> serialized object so that deserialization leads to arbitrary code >>>>> execution. I think that the best way to protect against that is to avoid >>>>> making it possible for an attacker to modify serialized bytes. >>>>> >>>>> To my knowledge, Java serialization is used in two places: first, to >>>>> serialize objects between nodes, like sending a task to a Spark executor, >>>>> and second, to serialize some persistent state in Flink. Iceberg does not >>>>> use Java serialization for anything in the format or long-term storage. >>>>> For >>>>> the first case, I think that it is up to the distributed system passing >>>>> objects between nodes to secure the content, like using TLS for >>>>> connections >>>>> between nodes. Since Java serialization is used by the processing engine, >>>>> there isn't much Iceberg could do to change this and we have to rely on >>>>> Spark or Flink. >>>>> >>>>> For the second issue, I think our use of Java serialization to store >>>>> state is very limited, but we should take a look to make sure. I think >>>>> this >>>>> is one area where Iceberg made the choice to use Java serialization, so we >>>>> should look into it and fix it if possible... although I'm not entirely >>>>> sure how to avoid swapping out the state that gets loaded. >>>>> >>>>> Ryan >>>>> >>>>> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> We use Java serialization and deserialization a lot in Iceberg. I >>>>>> wonder if we have considered the potential of Java deserialization >>>>>> attack, >>>>>> where an attacker can replace serialized bytes to execute arbitrary code >>>>>> through the readObject method. >>>>>> >>>>>> Currently our SerializationUtil.deserializeFromBytes directly >>>>>> converts bytes to an ObjectInputStream. I know Apache commons have >>>>>> ValidatingObjectInputStream which can prevent the issue to some extent. >>>>>> >>>>>> Have we thought about this issue in the past? Are there any other >>>>>> suggestions? >>>>>> >>>>>> Best, >>>>>> Jack Ye >>>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >> >> -- >> Ryan Blue >> Tabular >> > -- Ryan Blue Tabular