On Fri, Jan 21, 2022 at 6:48 PM Sean Owen <sro...@gmail.com> wrote: > Continue on the ticket - I am not sure this is established. We would block > a release for critical problems that are not regressions. This is not a > data loss / 'deleting data' issue even if valid. > You're welcome to provide feedback but votes are for the PMC. > To be clear users and developers are more than welcome to vote, but only PMC votes are binding.
> > On Fri, Jan 21, 2022 at 5:24 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > >> Ok, but deleting users' data without them knowing it is never a good >> idea. That's why I give this RC -1. >> >> lør. 22. jan. 2022 kl. 00:16 skrev Sean Owen <sro...@gmail.com>: >> >>> (Bjorn - unless this is a regression, it would not block a release, even >>> if it's a bug) >>> >>> On Fri, Jan 21, 2022 at 5:09 PM Bjørn Jørgensen < >>> bjornjorgen...@gmail.com> wrote: >>> >>>> [x] -1 Do not release this package because, deletes all my columns with >>>> only Null in it. >>>> >>>> I have opened https://issues.apache.org/jira/browse/SPARK-37981 for >>>> this bug. >>>> >>>> >>>> >>>> >>>> fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen <sro...@gmail.com>: >>>> >>>>> (Are you suggesting this is a regression, or is it a general question? >>>>> here we're trying to figure out whether there are critical bugs introduced >>>>> in 3.2.1 vs 3.2.0) >>>>> >>>>> On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen < >>>>> bjornjorgen...@gmail.com> wrote: >>>>> >>>>>> Hi, I am wondering if it's a bug or not. >>>>>> >>>>>> I do have a lot of json files, where they have some columns that are >>>>>> all "null" on. >>>>>> >>>>>> I start spark with >>>>>> >>>>>> from pyspark import pandas as ps >>>>>> import re >>>>>> import numpy as np >>>>>> import os >>>>>> import pandas as pd >>>>>> >>>>>> from pyspark import SparkContext, SparkConf >>>>>> from pyspark.sql import SparkSession >>>>>> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, >>>>>> expr >>>>>> from pyspark.sql.types import StructType, StructField, >>>>>> StringType,IntegerType >>>>>> >>>>>> os.environ["PYARROW_IGNORE_TIMEZONE"]="1" >>>>>> >>>>>> def get_spark_session(app_name: str, conf: SparkConf): >>>>>> conf.setMaster('local[*]') >>>>>> conf \ >>>>>> .set('spark.driver.memory', '64g')\ >>>>>> .set("fs.s3a.access.key", "minio") \ >>>>>> .set("fs.s3a.secret.key", "") \ >>>>>> .set("fs.s3a.endpoint", "http://192.168.1.127:9000") \ >>>>>> .set("spark.hadoop.fs.s3a.impl", >>>>>> "org.apache.hadoop.fs.s3a.S3AFileSystem") \ >>>>>> .set("spark.hadoop.fs.s3a.path.style.access", "true") \ >>>>>> .set("spark.sql.repl.eagerEval.enabled", "True") \ >>>>>> .set("spark.sql.adaptive.enabled", "True") \ >>>>>> .set("spark.serializer", >>>>>> "org.apache.spark.serializer.KryoSerializer") \ >>>>>> .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \ >>>>>> .set("sc.setLogLevel", "error") >>>>>> >>>>>> return >>>>>> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() >>>>>> >>>>>> spark = get_spark_session("Falk", SparkConf()) >>>>>> >>>>>> d3 = >>>>>> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") >>>>>> >>>>>> import pyspark >>>>>> def sparkShape(dataFrame): >>>>>> return (dataFrame.count(), len(dataFrame.columns)) >>>>>> pyspark.sql.dataframe.DataFrame.shape = sparkShape >>>>>> print(d3.shape()) >>>>>> >>>>>> >>>>>> (653610, 267) >>>>>> >>>>>> >>>>>> d3.write.json("d3.json") >>>>>> >>>>>> >>>>>> d3 = spark.read.json("d3.json/*.json") >>>>>> >>>>>> import pyspark >>>>>> def sparkShape(dataFrame): >>>>>> return (dataFrame.count(), len(dataFrame.columns)) >>>>>> pyspark.sql.dataframe.DataFrame.shape = sparkShape >>>>>> print(d3.shape()) >>>>>> >>>>>> (653610, 186) >>>>>> >>>>>> >>>>>> So spark is deleting 81 columns. I think that all of these 81 deleted >>>>>> columns have only Null in them. >>>>>> >>>>>> Is this a bug or has this been made on purpose? >>>>>> >>>>>> >>>>>> fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao <huaxin.ga...@gmail.com >>>>>> >: >>>>>> >>>>>>> Please vote on releasing the following candidate as Apache Spark >>>>>>> version 3.2.1. The vote is open until 8:00pm Pacific time January 25 and >>>>>>> passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 >>>>>>> votes. [ >>>>>>> ] +1 Release this package as Apache Spark 3.2.1[ ] -1 Do not >>>>>>> release this package because ... To learn more about Apache Spark, >>>>>>> please >>>>>>> see http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 >>>>>>> (commit 4f25b3f71238a00508a356591553f2dfa89f8290): >>>>>>> https://github.com/apache/spark/tree/v3.2.1-rc2 >>>>>>> The release files, including signatures, digests, etc. can be found >>>>>>> at:https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/ >>>>>>> Signatures used for Spark RCs can be found in this file: >>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging >>>>>>> repository for this release can be found at: >>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1398/ >>>>>>> >>>>>>> The documentation corresponding to this release can be found at: >>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/ >>>>>>> >>>>>>> The list of bug fixes going into 3.2.1 can be found at the following >>>>>>> URL:https://s.apache.org/yu0cy >>>>>>> >>>>>>> This release is using the release script of the tag v3.2.1-rc2. FAQ >>>>>>> ========================= How can I help test this release? >>>>>>> ========================= If you are a Spark user, you can help us test >>>>>>> this release by taking an existing Spark workload and running on this >>>>>>> release candidate, then reporting any regressions. If you're working in >>>>>>> PySpark you can set up a virtual env and install the current RC and see >>>>>>> if >>>>>>> anything important breaks, in the Java/Scala you can add the staging >>>>>>> repository to your projects resolvers and test with the RC (make sure to >>>>>>> clean up the artifact cache before/after so you don't end up building >>>>>>> with >>>>>>> a out of date RC going forward). >>>>>>> =========================================== What should happen to JIRA >>>>>>> tickets still targeting 3.2.1? >>>>>>> =========================================== >>>>>>> The current list of open tickets targeted at 3.2.1 can be found at: >>>>>>> https://issues.apache.org/jira/projects/SPARK and search for >>>>>>> "Target Version/s" = 3.2.1 Committers should look at those and triage. >>>>>>> Extremely important bug fixes, documentation, and API tweaks that impact >>>>>>> compatibility should be worked on immediately. Everything else please >>>>>>> retarget to an appropriate release. ================== But my bug isn't >>>>>>> fixed? ================== In order to make timely releases, we will >>>>>>> typically not hold the release unless the bug in question is a >>>>>>> regression >>>>>>> from the previous release. That being said, if there is something which >>>>>>> is >>>>>>> a regression that has not been correctly targeted please ping me or a >>>>>>> committer to help target the issue. >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Bjørn Jørgensen >>>>>> Vestre Aspehaug 4 >>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>, >>>>>> 6010 Ålesund >>>>>> Norge >>>>>> >>>>>> +47 480 94 297 >>>>>> >>>>> >>>> >>>> -- >>>> Bjørn Jørgensen >>>> Vestre Aspehaug 4 >>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>, >>>> 6010 Ålesund >>>> Norge >>>> >>>> +47 480 94 297 >>>> >>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4 >> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>, >> 6010 Ålesund >> Norge >> >> +47 480 94 297 >> > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau