Re: Building Spark to run PySpark Tests?

Bjørn Jørgensen Wed, 18 Jan 2023 12:05:56 -0800

Replace
> > git clone g...@github.com:apache/spark.git
> > git checkout -b spark-321 v3.2.1


with
git clone --branch branch-3.2 https://github.com/apache/spark.git
This will give you branch 3.2 as today, what I suppose you call upstream

https://github.com/apache/spark/commits/branch-3.2
and right now all tests in github action are passed :)


ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <sro...@gmail.com>:

> Never seen those, but it's probably a difference in pandas, numpy
> versions. You can see the current CICD test results in GitHub Actions. But,
> you want to use release versions, not an RC. 3.2.1 is not the latest
> version, and it's possible the tests were actually failing in the RC.
>
> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <amanschh...@gmail.com> wrote:
>
>> Bump,
>>
>> Just trying to see where I can find what tests are known failing for a
>> particular release, to ensure I’m building upstream correctly following the
>> build docs. I figured this would be the best place to ask as it pertains to
>> building and testing upstream (also more than happy to provide a PR for any
>> docs if required afterwards), however if there would be a more appropriate
>> place, please let me know.
>>
>> Best,
>>
>> Adam Chhina
>>
>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <amanschh...@gmail.com>
>> wrote:
>> >
>> > As part of an upgrade I was looking to run upstream PySpark unit tests
>> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
>> However, I'm running into some issues with failing unit tests, which I'm
>> not sure are failing upstream or due to some step I missed in the build.
>> >
>> > The current failing tests (at least so far, since I believe the python
>> script exits on test failure):
>> > ```
>> > ======================================================================
>> > FAIL: test_train_prediction
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>> > Test that error on test data improves as model is trained.
>> > ----------------------------------------------------------------------
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 474, in test_train_prediction
>> >     eventually(condition, timeout=180.0)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 86, in eventually
>> >     lastValue = condition()
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 469, in condition
>> >     self.assertGreater(errors[1] - errors[-1], 2)
>> > AssertionError: 1.8960983527735014 not greater than 2
>> >
>> > ======================================================================
>> > FAIL: test_parameter_accuracy
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>> > Test that the final value of weights is close to the desired value.
>> > ----------------------------------------------------------------------
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 229, in test_parameter_accuracy
>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 91, in eventually
>> >     raise lastValue
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 82, in eventually
>> >     lastValue = condition()
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 226, in condition
>> >     self.assertAlmostEqual(rel, 0.1, 1)
>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
>> (0.13052813480829392 difference)
>> >
>> > ======================================================================
>> > FAIL: test_training_and_prediction
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>> > Test that the model improves on toy data with no. of batches
>> > ----------------------------------------------------------------------
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 334, in test_training_and_prediction
>> >     eventually(condition, timeout=180.0)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 93, in eventually
>> >     raise AssertionError(
>> > AssertionError: Test failed due to timeout after 180 sec, with last
>> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
>> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
>> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
>> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>> >
>> > ----------------------------------------------------------------------
>> > Ran 13 tests in 661.536s
>> >
>> > FAILED (failures=3, skipped=1)
>> >
>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with
>> /usr/local/bin/python3; see logs.
>> > ```
>> >
>> > Here's how I'm currently building Spark, I was using the
>> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
>> docs as a reference.
>> > ```
>> > > git clone g...@github.com:apache/spark.git
>> > > git checkout -b spark-321 v3.2.1
>> > > ./build/mvn -DskipTests clean package -Phive
>> > > export JAVA_HOME=$(path/to/jdk/11)
>> > > ./python/run-tests
>> > ```
>> >
>> > Current Java version
>> > ```
>> > java -version
>> > openjdk version "11.0.17" 2022-10-18
>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>> > ```
>> >
>> > Alternatively, I've also tried simply building Spark and using a
>> python=3.9 venv and installing the requirements from `pip install -r
>> dev/requirements.txt` and using that as the interpreter to run tests.
>> However, I was running into some failing pandas test which to me seemed
>> like it was coming from a pandas version difference as `requirements.txt`
>> didn't specify a version.
>> >
>> > I suppose I have a couple of questions in regards to this:
>> > 1. Am I missing a build step to build Spark and run PySpark unit tests?
>> > 2. Where could I find whether an upstream test is failing for a
>> specific release?
>> > 3. Would it be possible to configure the `run-tests` script to run all
>> tests regardless of test failures?
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Building Spark to run PySpark Tests?

Reply via email to