Re: [VOTE] Release Spark 3.2.1 (RC2)

Bjørn Jørgensen Fri, 21 Jan 2022 15:09:38 -0800

[x] -1 Do not release this package because, deletes all my columns with
only Null in it.


I have opened https://issues.apache.org/jira/browse/SPARK-37981 for this
bug.




fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen <[email protected]>:

> (Are you suggesting this is a regression, or is it a general question?
> here we're trying to figure out whether there are critical bugs introduced
> in 3.2.1 vs 3.2.0)
>
> On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen <[email protected]>
> wrote:
>
>> Hi, I am wondering if it's a bug or not.
>>
>> I do have a lot of json files, where they have some columns that are all
>> "null" on.
>>
>> I start spark with
>>
>> from pyspark import pandas as ps
>> import re
>> import numpy as np
>> import os
>> import pandas as pd
>>
>> from pyspark import SparkContext, SparkConf
>> from pyspark.sql import SparkSession
>> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
>> from pyspark.sql.types import StructType, StructField,
>> StringType,IntegerType
>>
>> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
>>
>> def get_spark_session(app_name: str, conf: SparkConf):
>>     conf.setMaster('local[*]')
>>     conf \
>>       .set('spark.driver.memory', '64g')\
>>       .set("fs.s3a.access.key", "minio") \
>>       .set("fs.s3a.secret.key", "") \
>>       .set("fs.s3a.endpoint", "http://192.168.1.127:9000";) \
>>       .set("spark.hadoop.fs.s3a.impl",
>> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>>       .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>>       .set("spark.sql.repl.eagerEval.enabled", "True") \
>>       .set("spark.sql.adaptive.enabled", "True") \
>>       .set("spark.serializer",
>> "org.apache.spark.serializer.KryoSerializer") \
>>       .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \
>>       .set("sc.setLogLevel", "error")
>>
>>     return
>> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
>>
>> spark = get_spark_session("Falk", SparkConf())
>>
>> d3 =
>> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
>>
>> import pyspark
>> def sparkShape(dataFrame):
>>     return (dataFrame.count(), len(dataFrame.columns))
>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>> print(d3.shape())
>>
>>
>> (653610, 267)
>>
>>
>> d3.write.json("d3.json")
>>
>>
>> d3 = spark.read.json("d3.json/*.json")
>>
>> import pyspark
>> def sparkShape(dataFrame):
>>     return (dataFrame.count(), len(dataFrame.columns))
>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>> print(d3.shape())
>>
>> (653610, 186)
>>
>>
>> So spark is deleting 81 columns. I think that all of these 81 deleted
>> columns have only Null in them.
>>
>> Is this a bug or has this been made on purpose?
>>
>>
>> fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao <[email protected]>:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.2.1. The vote is open until 8:00pm Pacific time January 25 and passes if
>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1
>>> Release this package as Apache Spark 3.2.1[ ] -1 Do not release this
>>> package because ... To learn more about Apache Spark, please see
>>> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit
>>> 4f25b3f71238a00508a356591553f2dfa89f8290):
>>> https://github.com/apache/spark/tree/v3.2.1-rc2
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging
>>> repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1398/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/
>>> The list of bug fixes going into 3.2.1 can be found at the following URL:
>>> https://s.apache.org/yu0cy
>>>
>>> This release is using the release script of the tag v3.2.1-rc2. FAQ
>>> ========================= How can I help test this release?
>>> ========================= If you are a Spark user, you can help us test
>>> this release by taking an existing Spark workload and running on this
>>> release candidate, then reporting any regressions. If you're working in
>>> PySpark you can set up a virtual env and install the current RC and see if
>>> anything important breaks, in the Java/Scala you can add the staging
>>> repository to your projects resolvers and test with the RC (make sure to
>>> clean up the artifact cache before/after so you don't end up building with
>>> a out of date RC going forward).
>>> =========================================== What should happen to JIRA
>>> tickets still targeting 3.2.1? ===========================================
>>> The current list of open tickets targeted at 3.2.1 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.2.1 Committers should look at those and triage. Extremely
>>> important bug fixes, documentation, and API tweaks that impact
>>> compatibility should be worked on immediately. Everything else please
>>> retarget to an appropriate release. ================== But my bug isn't
>>> fixed? ================== In order to make timely releases, we will
>>> typically not hold the release unless the bug in question is a regression
>>> from the previous release. That being said, if there is something which is
>>> a regression that has not been correctly targeted please ping me or a
>>> committer to help target the issue.
>>>
>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: [VOTE] Release Spark 3.2.1 (RC2)

Reply via email to