Dear Spark Development Community,

Our team is using PySpark (versions 3.5.x, currently testing 3.5.5) and we 
integrate Static Application Security Testing (SAST/SCA) using tools like 
Checkmarx into our CI/CD pipelines for our Python projects.

We've observed that a significant number of Critical/High SCA vulnerabilities 
(1046 Results (174 critical, 493 high risk, 319 medium, 60 low risk) flagged by 
Checkmarx appear to originate from parsing embedded pom.xml files within older 
Java JARs bundled as part of the PySpark distribution (often found in the 
pyspark/jars/ directory).

A prominent example we've encountered involves 
com.fasterxml.jackson.core:jackson-databind.
Checkmarx identifies an old, vulnerable version (e.g., 2.3.2) as declared in 
the embedded pom.xml of a transitively included JAR like zjsonpatch-0.3.0.jar 
(which itself is located in .../pyspark/jars/).

However, our investigation, and the contents of the pyspark/jars/ directory, 
confirm that PySpark 3.5.x also bundles a much newer, non-vulnerable version, 
jackson-databind-2.15.2.jar, directly in the same directory. We understand that 
Spark's classpath loading mechanism ensures that this newer, directly bundled 
jackson-databind-2.15.2.jar takes precedence at runtime, effectively mitigating 
the risk from the older version declared by zjsonpatch.

While this runtime behavior correctly mitigates the vulnerability, the 
discrepancy between the declared dependencies in the embedded pom.xml of older 
JARs and the effectively used versions leads to a significant number of SCA 
findings that are effectively false positives. This requires considerable 
manual effort from our team to investigate, document, and justify these 
findings for each scan.

Our question is whether there's a possibility or a recommended approach from 
the Spark project's perspective to address this.
Specifically:
Could the pom.xml files embedded within these older, transitively included JARs 
(like zjsonpatch-0.3.0.jar) be updated or somehow reconciled during PySpark's 
packaging/distribution process to reflect the versions of dependencies (like 
jackson-databind) that are actually loaded and used at runtime due to the 
presence of newer, directly bundled versions?

Alternatively, is there a strategy for managing these bundled dependencies in a 
way that would provide clearer signals to SCA tools about the effective 
versions in use, thereby reducing these types of false positive alerts?

Our goal is to reduce the noise from SCA scans and allow our team to focus on 
genuine vulnerabilities.
Any insights, suggestions, or information on whether this has been considered, 
or if there are existing mechanisms to mitigate this from the Spark packaging 
side, would be greatly appreciated.

Thank you for your time and the fantastic work on Apache Spark.


King regards,


Mikalai


ING Hubs B.V. z siedzib? w Amsterdamie, Holandia, VAT PL 526-319-58-54, 
dzia?aj?ca w Polsce w formie oddzia?u, pod firm? ING Hubs B.V. sp??ka z 
ograniczon? odpowiedzialno?ci? Oddzia? w Polsce z siedzib? w Katowicach, ul. 
Zabrska 19, 40-083 Katowice, NIP: 2050005130, wpisana do rejestru 
przedsi?biorc?w Krajowego Rejestru S?dowego prowadzonego przez S?d Rejonowy 
Katowice-Wsch?d w Katowicach, VIII Wydzia? Gospodarczy Krajowego Rejestru 
S?dowego pod numerem KRS 0000702305.

Reply via email to