4.0.0-preview1 test report: running on Yarn

2024-06-17 Thread George Magiros
I successfully submitted and ran org.apache.spark.examples.SparkPi on Yarn
using 4.0.0-preview1.  However I got it to work only after fixing an issue
with the Yarn nodemanagers (Hadoop v3.3.6 and v3.4.0).  Namely the issue
was:
1. If the nodemanagers used java 11, Yarn threw an error about not finding
the jdk.incubator.vector module.
2. If the nodemanagers used java 17, which has the jdk.incubator.vector
module, Yarn threw a reflection error about class not found.

To resolve the error and successfully calculate pi,
1. I ran java 17 on the nodemanagers and
2. added 'export HADOOP_OPTS="--add-opens=java.base/java.lang=ALL-UNNAMED"'
to their conf/hadoop-env.sh file.

George


Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-17 Thread Allison Wang
I'm a big +1 on this proposal. We should be able to continue improving the
programming guides to enhance their quality and make this process easier.

> Move the programming guide to the spark-website repo, to allow faster
iterations and releases

This is a great idea. It should work for structured streaming programming
guides. PySpark's user guides (
https://spark.apache.org/docs/latest/api/python/user_guide/index.html) are
actually generated by Sphinx so I am not sure if they can also be moved to
the spark-website.

> I think the documentation should be version specific- but separate from
spark release cadence - and can be updated multiple times after spark
release.

@Nimrod, here is the discussion thread to separate the Spark docs releases
from the Spark releases:
https://lists.apache.org/thread/1675rzxx5x4j2x03t9x0kfph8tlys0cx. This will
allow us to keep improving version-specific docs as well.



On Tue, Jun 11, 2024 at 4:00 PM serge rielau.com  wrote:

> I think some of the issues raised here are not really common.
> Examples should follow best practice.
> It would be odd to have an example that exploits ansi.enabled=false to
> e.g. overflow an integer.
> Instead an example that works with ansi mode will typically work perfectly
> fine in an older version, especially at the level of discussion here, which
> is towards starter guides.
> What can happen of course is that best practice in a new version is
> different from best practice in an older version.
> But even then we want to bias towards the new version to bring people
> along.
> The old "workaround" and the new best practice can be shown with a
> disclaimer regarding the version they apply to. (I..e. we version WITHIN
> the page)
> Note that for e.g. builtin functions we already do this. We state when a
> function was introduced.
>
> IMHO the value of a unified doc tree can not be overstated when it comes
> to searchability (SEO).
>
>
> On Jun 11, 2024, at 11:37 AM, Wenchen Fan  wrote:
>
> Shall we decouple these two decisions?
>
>- Move the programming guide to the spark-website repo, to allow
>faster iterations and releases
>- Make programming guide version-less
>
> I think the downside of moving the programming guide to the spark-website
> repo is almost negligible: you may need to have PRs in both the Spark and
> spark-website repo for major features that need to be mentioned in the
> programming guide. The release process may need more steps to build the doc
> site.
>
> We can have more discussions on version-less. Today we upload the full doc
> site for each maintenance release, which is a big waste as the content is
> almost the same with the previous maintenance release. As a result, git
> operations on the spark-website repo are quite slow today, as this repo is
> too big. I think we should at least have a single programming guide for
> each feature release.
>
>
> On Tue, Jun 11, 2024 at 10:36 AM Neil Ramaswamy 
> wrote:
>
>> There are two issues and one main benefit that I see with versioned
>> programming guides:
>>
>>- *Issue 1*: We often retroactively realize that code snippets have
>>bugs and explanations are confusing (see examples: dropDuplicates
>>,
>>dropDuplicatesWithinWatermark
>>
>> ).
>>Without backporting to older guides, I don't think that users can have, as
>>Mridul says, "reasonable confidence that features, functionality and
>>examples mentioned will work with that released Spark version". In this
>>sense, I definitely disagree with Nimrod's position of "working on updated
>>versions and not working with old versions anyway." To have confidence in
>>versioned programming guides, we *must *have a system for backporting
>>and re-releasing.
>>- *Issue 2*: If programming guides live in the Spark website, you now
>>need maintenance releases in Spark to get those changes to production 
>> (i.e.
>>spark-website). Historically, Spark does *not *create maintenance
>>releases frequently, especially not just for a docs change. So, we'd need
>>to break precedent (this will create potentially dozens of minor releases,
>>far more than what we do today), and the person making docs changes needs
>>to rebuild the docs site and create one PR in spark-website for *every
>>*version they change. Fixing a code typo in 4 versions? You need 4
>>maintenance releases, and 4 more PRs.
>>- *Benefit 1*: versioned docs don't have to caveat what features are
>>available in prose.
>>
>>
>> Personally, I think it's fine to caveat what features are available in
>> prose. For the rare case where we have *completely *incompatible Spark
>> code (which should be exceedingly rare), we can provide different code
>> snippets. As Wenchen points out, if we *do *have 100 mutually
>> incompatible versions, we have an issu

Re: 4.0.0-preview1 test report: running on Yarn

2024-06-17 Thread Wenchen Fan
Thanks for sharing! Yea Spark 4.0 is built using Java 17.

On Tue, Jun 18, 2024 at 5:07 AM George Magiros  wrote:

> I successfully submitted and ran org.apache.spark.examples.SparkPi on Yarn
> using 4.0.0-preview1.  However I got it to work only after fixing an issue
> with the Yarn nodemanagers (Hadoop v3.3.6 and v3.4.0).  Namely the issue
> was:
> 1. If the nodemanagers used java 11, Yarn threw an error about not finding
> the jdk.incubator.vector module.
> 2. If the nodemanagers used java 17, which has the jdk.incubator.vector
> module, Yarn threw a reflection error about class not found.
>
> To resolve the error and successfully calculate pi,
> 1. I ran java 17 on the nodemanagers and
> 2. added 'export
> HADOOP_OPTS="--add-opens=java.base/java.lang=ALL-UNNAMED"' to their
> conf/hadoop-env.sh file.
>
> George
>
>


Re: 4.0.0-preview1 test report: running on Yarn

2024-06-17 Thread Cheng Pan
You don’t need to upgrade Java for HDFS and YARN. Just keep using Java 8 for 
Hadoop and set JAVA_HOME to Java 17 for Spark applications[1].

0. Install Java 17 on all nodes, for example, under /opt/openjdk-17

1. Modify $SPARK_CONF_DIR/spark-env.sh
export JAVA_HOME=/opt/openjdk-17

2. Modify $SPARK_CONF_DIR/spark-defaults.conf
spark.yarn.appMasterEnv.JAVA_HOME=/opt/openjdk-17
spark.executorEnv.JAVA_HOME=/opt/openjdk-17

[1] 
https://github.com/awesome-kyuubi/hadoop-testing/commit/9f7c0d7388dfc7fbe6e4658515a6c28d5ba93c8e

Thanks,
Cheng Pan


> On Jun 18, 2024, at 02:00, George Magiros  wrote:
> 
> I successfully submitted and ran org.apache.spark.examples.SparkPi on Yarn 
> using 4.0.0-preview1.  However I got it to work only after fixing an issue 
> with the Yarn nodemanagers (Hadoop v3.3.6 and v3.4.0).  Namely the issue was:
> 1. If the nodemanagers used java 11, Yarn threw an error about not finding 
> the jdk.incubator.vector module.
> 2. If the nodemanagers used java 17, which has the jdk.incubator.vector 
> module, Yarn threw a reflection error about class not found.
> 
> To resolve the error and successfully calculate pi, 
> 1. I ran java 17 on the nodemanagers and 
> 2. added 'export HADOOP_OPTS="--add-opens=java.base/java.lang=ALL-UNNAMED"' 
> to their conf/hadoop-env.sh file.
> 
> George
> 



unsubscribe

2024-06-17 Thread Cenk Ariöz
unsubscribe