Hi Alex, the situation for UBI9 doesn't look much different from Ubuntu:
registry.access.redhat.com/ubi9/ubi (redhat 9.3) Total: 166 (UNKNOWN: 0, LOW: 138, MEDIUM: 28, HIGH: 0, CRITICAL: 0) Full list: https://gist.github.com/merlimat/ba96b91ea49709bb218ddc3906bb9e95 -- Matteo Merli <matteo.me...@gmail.com> On Thu, Feb 15, 2024 at 9:10 AM Alexander Hall <ah...@teknoluxion.com.invalid> wrote: > Reviving a previous tangent from this discussion. Using UBI9 as a base is > also a great option. Some end-users use that as a base and copy the files > from the pulsar and pulsar-all containers as an upstream source. > > -Alex H > > -----Original Message----- > From: Matteo Merli <matteo.me...@gmail.com> > Sent: Wednesday, February 14, 2024 2:01 PM > To: david.chris...@discordapp.com.invalid > Cc: dev@pulsar.apache.org > Subject: ''Re: Re: [DISCUSS] PIP-324: Alpine Docker images > > [You don't often get email from *REDACTED*. Learn why this is important at > https://aka.ms/LearnAboutSenderIdentification ] > > Reviving the discussion thread. > > > > For Netty, I think netty-transport-native-epoll is only built against > > glibc ( > > https://netty.io/wiki/native-transports.html#using-the-linux-native-transport > ). > > Is there a workaround ? > > Yes, there is a workaround for Netty. It works perfectly fine by including > the GLibc compatibility library. Same for Kinesis producer (side note: > Kinesis SDK is the worst train wreck I've seen in many many years: it's a > C++ binary that it spawned from Java and communicates through a pipe... > anyway it works fine with the GLibc compatibility lib). > > > Other than that, there is the DNS caching issue Lari mentioned. > > I think the DNS issue was already solved a few releases ago. In any case, > it wouldn't affect Pulsar/BK since we use the Netty DNS client. In the same > way, I believe that JDK also doesn't use the glibc provided DNS client: > that's why we configure the DNS cache directly in the JVM configuration. > > >> - Using a smaller base image like Alpine can save space. The relative > size of the JRE image for Alpine is about 45% smaller than the equivalent > Ubuntu slim image. > >> - The Ubuntu image has a few tens of CVEs in it, as reported by an > automated container CVE scan tool, compared to 0 in Alpine. > > These seem reasonable, but the true magnitude of benefit is likely > > lower > in practice. The pulsar-all images are 2.7GB in size, so saving 166MB on > the base + JRE install translates to just a 6% smaller image. Unless we > expect other installed packages part of pulsar-all to gain additional space > savings on Alpine, this difference seems very marginal in practice. > > `pulsar-all` is ready for separate discussion (I actually think we should > discontinue that image). > > For `pulsar` image: > * apache/pulsar:3.2.0 (which already does not include Presto anymore): > 919 MB > * alpine image wip: 505 MB > > There are additional ways we should explore to further reduce the image > size (eg: removing unused JDK modules, Python packages, etc...) > > > Security-wise, I took a cursory look at the CVEs, and many of them are > > in > libraries that aren’t used in a Pulsar deployment/are difficult to > envision a practical exploit scenario. Automated scanning tool results > should be taken with a grain of salt - they generate a lot of alerts, and > many public container images throw off these CVE alerts nowadays. The > counterargument is that only a fraction of the libraries indicated are even > loaded at runtime, only some fraction of those end up potentially being > exploitable, and only a smaller fraction have no fix/workaround. This isn’t > to say reducing the vulnerability surface by using an image with less cruft > in it is not a worthwhile endeavor — I do think we should try to tackle it > -- but I’m simply trying to be realistic about what our actual gains will > be from switching to Alpine. > > Even though the CVEs might not be a "real" security issue, or not be > exploitable in the context of Pulsar, it is really not how any security > team would look at it. From their perspective, it becomes unmanageable to > check and understand every single CVE to assess the potential specific > threat. > > This is a real problem that is causing a lot of headaches to have Pulsar > distribution taken seriously from a security posture perspective. > > Just have a glance at the security CVE issues in our last Pulsar release, > released just a few days ago: > > apachepulsar/pulsar:3.2.0 (ubuntu 22.04) > Total: 243 (UNKNOWN: 0, LOW: 146, MEDIUM: 93, HIGH: 4, CRITICAL: 0) > > Compare with Pulsar image based on Alpine: > > merlimat/pulsar:3.3.0-SNAPSHOT-f2a91a1 (alpine 3.19.1) > Total: 0 (UNKNOWN: 0, LOW: 0, MEDIUM: 0, HIGH: 0, CRITICAL: 0) > > Full list here: > https://gist.github.com/merlimat/ee7534992b21cae0b04c8c63f64456ff > The above are all issues coming from Ubuntu base image. > > > It’s also worth mentioning we’d be moving away from other large > open-source big data projects in a way. Spark [2], Flink [3], Kafka [4], > Elasticsearch [5], and Trino [6] are based on Temurin/Ubuntu/ubi. In my > brief search, I didn’t find familiar names of tools in the big data > ecosystem with official images based on Alpine. > > Distroless would also remove almost everything from our base images, > minimizing space, reducing the vulnerability surface, and by extension, > reducing the CVE alerts from automated tooling. Apache Druid [7] has used > Distroless for a while in their official images. We could achieve the same > aims without any risk from musl/glibc, DNS quirks, or other hiccups that > Alpine may have. > > > Regarding the OpenJDK distribution, the team from Amazon Corretto, > publishes well tested and supported Alpine packages. See > https://aws.amazon.com/corretto > > I have created a WIP/draft PR to show the potential changes: > https://github.com/apache/pulsar/pull/22054 > > The image already passes all the integration tests and has been tested for > few weeks in a test cluster. > > I have pushed a Docker image for preview purposes: > merlimat/pulsar/3.3.0-SNAPSHOT-f2a91a1 > > > https://hub.docker.com/layers/merlimat/pulsar/3.3.0-SNAPSHOT-f2a91a1/images/sha256-2d94832618bf30c02baa269bdf943c8f37aa5430258b7b4018f37ed120abb17a?context=explore > > Thanks, > Matteo > > -- > Matteo Merli > <matteo.me...@gmail.com> > > > On Wed, Dec 20, 2023 at 12:49 PM David Christle > <david.chris...@discordapp.com.invalid> wrote: > > > Are we sure the move to Alpine is worth the extensive performance > > testing and the risk of issues? Sticking with a popular glibc image > > like Temurin, Ubuntu/Debian, or ubi-minimal (mentioned also in this > > discussion) seems like a better path to me, without the risk of glibc > > vs musl issues. Using Distroless seems like another good potential > > option, as it would achieve the same aims as the Alpine move, with less > potential risk. > > > > The DNS issues seen with Alpine are worth paying strong attention to. > > Someone running a Pulsar deployment using the images could have a very > > difficult time debugging library/glibc vs musl/DNS issues, due to > > their low-level nature. A fix for the DNS issue only landed less than > > a year ago [1]. Unless we have a compelling reason for Alpine, it may > > be safer to wait for more adoption/testing before choosing it for the > official Pulsar images. > > > > The two main arguments in the PIP are: > > > > - Using a smaller base image like Alpine can save space. The relative > > size of the JRE image for Alpine is about 45% smaller than the > > equivalent Ubuntu slim image. > > > > - The Ubuntu image has a few tens of CVEs in it, as reported by an > > automated container CVE scan tool, compared to 0 in Alpine. > > > > > > These seem reasonable, but the true magnitude of benefit is likely > > lower in practice. The pulsar-all images are 2.7GB in size, so saving > > 166MB on the base + JRE install translates to just a 6% smaller image. > > Unless we expect other installed packages part of pulsar-all to gain > > additional space savings on Alpine, this difference seems very marginal > in practice. > > > > Security-wise, I took a cursory look at the CVEs, and many of them are > > in libraries that aren’t used in a Pulsar deployment/are difficult to > > envision a practical exploit scenario. Automated scanning tool results > > should be taken with a grain of salt - they generate a lot of alerts, > > and many public container images throw off these CVE alerts nowadays. > > The counterargument is that only a fraction of the libraries indicated > > are even loaded at runtime, only some fraction of those end up > > potentially being exploitable, and only a smaller fraction have no > > fix/workaround. This isn’t to say reducing the vulnerability surface > > by using an image with less cruft in it is not a worthwhile endeavor — > > I do think we should try to tackle it -- but I’m simply trying to be > > realistic about what our actual gains will be from switching to Alpine. > > > > It’s also worth mentioning we’d be moving away from other large > > open-source big data projects in a way. Spark [2], Flink [3], Kafka > > [4], Elasticsearch [5], and Trino [6] are based on Temurin/Ubuntu/ubi. > > In my brief search, I didn’t find familiar names of tools in the big > > data ecosystem with official images based on Alpine. > > > > Distroless would also remove almost everything from our base images, > > minimizing space, reducing the vulnerability surface, and by > > extension, reducing the CVE alerts from automated tooling. Apache > > Druid [7] has used Distroless for a while in their official images. We > > could achieve the same aims without any risk from musl/glibc, DNS > > quirks, or other hiccups that Alpine may have. > > > > Regards, > > David > > > > > > [1] > > https://gitlab.alpinelinux.org/alpine/tsc/-/issues/43#note_295556 > > [2] Apache Spark - Temurin - > > https://github.com/apache/flink-docker/tree/master/1.18 > > [3] Apache Flink - Temurin - > > https://github.com/apache/flink-docker/tree/master/1.18 > > [4] KIP-975: Docker Image for Apache Kafka - Temurin - > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-975%3A+Docker+Image+for+Apache+Kafka > > [5] Elasticsearch - Ubuntu & ubi-minimal - > > > https://github.com/elastic/elasticsearch/blob/bdde29720a9e37224a90e5f186abbcbc73ff9351/distribution/docker/README.md > [6] Trino - ubi, after moving from Ubuntu - > > > https://hub.docker.com/layers/trinodb/trino/435/images/sha256-9540a785c31c4ba9ad099ad99ae06ccd5ccca506e39b7d557effe1482309e05d > > [7] Apache Druid - Distroless - > > > https://github.com/apache/druid/blob/e373f6269251655f5be93ce895aee8dee8cc67dd/distribution/docker/Dockerfile#L4 > > > > > > On 2023/12/13 17:06:12 Matteo Merli wrote: > > > I don't think the compatibility for downstream users is going to be > > > a big > > > problem: > > > 1. Most users don't need to modify the Pulsar image in significant > > > way 2. If they do, they won't be using the "latest" tag, but rather > > > a > > specific > > > version > > > 3. Users who are dependent on the Ubuntu base image can stay on the > > > 3.0 LTS release branch for the entire LTS lifespan > > > > > > I would avoid supporting 2 images at the same time because it would > > > make > > it > > > very hard to properly test them both. > > > > > > > > > -- > > > Matteo Merli > > > <mm...@apache.org> > > > > > > > > > On Tue, Dec 12, 2023 at 8:57 PM Zixuan Liu <zi...@apache.org> wrote: > > > > > > > +1. > > > > > > > > It is a good idea to use the Alpine image to run the Pulsar, as it > > > > is > > more > > > > secure. > > > > > > > > However, switching images may affect downstream users, and I am > > wondering > > > > if it is possible to provide multiple docker tags: > > > > - latest: using the Ubuntu image > > > > - alpine: using the Alpine image > > > > > > > > Thanks, > > > > Zixuan > > > > > > > > Yunze Xu <xy...@apache.org> 于2023年12月13日周三 12:24写道: > > > > > > > > > +1 to me. The Alpine Linux is much more light-weight than Ubuntu. > > > > > > > > > > Thanks, > > > > > Yunze > > > > > > > > > > On Wed, Dec 13, 2023 at 3:00 AM Matteo Merli <mm...@apache.org> > > wrote: > > > > > > > > > > > > Hello, > > > > > > > > > > > > I've created a new proposal to switch Pulsar base docker > > > > > > images > > from > > > > > Ubuntu > > > > > > to Alpine Linux. > > > > > > > > > > > > Details and motivation in the PIP: > > > > > > https://github.com/apache/pulsar/pull/21716 > > > > > > > > > > > > Matteo > > > > > > > > > > > > -- > > > > > > Matteo Merli > > > > > > <mm...@apache.org> > > > > > > > > > > > > >