Dear Spark Development Community, Our team is using PySpark (versions 3.5.x, currently testing 3.5.5) and we integrate Static Application Security Testing (SAST/SCA) using tools like Checkmarx into our CI/CD pipelines for our Python projects.
We've observed that a significant number of Critical/High SCA vulnerabilities (1046 Results (174 critical, 493 high risk, 319 medium, 60 low risk) flagged by Checkmarx appear to originate from parsing embedded pom.xml files within older Java JARs bundled as part of the PySpark distribution (often found in the pyspark/jars/ directory). A prominent example we've encountered involves com.fasterxml.jackson.core:jackson-databind. Checkmarx identifies an old, vulnerable version (e.g., 2.3.2) as declared in the embedded pom.xml of a transitively included JAR like zjsonpatch-0.3.0.jar (which itself is located in .../pyspark/jars/). However, our investigation, and the contents of the pyspark/jars/ directory, confirm that PySpark 3.5.x also bundles a much newer, non-vulnerable version, jackson-databind-2.15.2.jar, directly in the same directory. We understand that Spark's classpath loading mechanism ensures that this newer, directly bundled jackson-databind-2.15.2.jar takes precedence at runtime, effectively mitigating the risk from the older version declared by zjsonpatch. While this runtime behavior correctly mitigates the vulnerability, the discrepancy between the declared dependencies in the embedded pom.xml of older JARs and the effectively used versions leads to a significant number of SCA findings that are effectively false positives. This requires considerable manual effort from our team to investigate, document, and justify these findings for each scan. Our question is whether there's a possibility or a recommended approach from the Spark project's perspective to address this. Specifically: Could the pom.xml files embedded within these older, transitively included JARs (like zjsonpatch-0.3.0.jar) be updated or somehow reconciled during PySpark's packaging/distribution process to reflect the versions of dependencies (like jackson-databind) that are actually loaded and used at runtime due to the presence of newer, directly bundled versions? Alternatively, is there a strategy for managing these bundled dependencies in a way that would provide clearer signals to SCA tools about the effective versions in use, thereby reducing these types of false positive alerts? Our goal is to reduce the noise from SCA scans and allow our team to focus on genuine vulnerabilities. Any insights, suggestions, or information on whether this has been considered, or if there are existing mechanisms to mitigate this from the Spark packaging side, would be greatly appreciated. Thank you for your time and the fantastic work on Apache Spark. King regards, Mikalai ING Hubs B.V. z siedzib? w Amsterdamie, Holandia, VAT PL 526-319-58-54, dzia?aj?ca w Polsce w formie oddzia?u, pod firm? ING Hubs B.V. sp??ka z ograniczon? odpowiedzialno?ci? Oddzia? w Polsce z siedzib? w Katowicach, ul. Zabrska 19, 40-083 Katowice, NIP: 2050005130, wpisana do rejestru przedsi?biorc?w Krajowego Rejestru S?dowego prowadzonego przez S?d Rejonowy Katowice-Wsch?d w Katowicach, VIII Wydzia? Gospodarczy Krajowego Rejestru S?dowego pod numerem KRS 0000702305.