I closed the ticket as a duplicate of SPARK-29444

This behavior is neither a bug nor a regression and there is already a
documented writer (or global) option that be can be used to modify it.

On 1/22/22 00:47, Sean Owen wrote:
> Continue on the ticket - I am not sure this is established. We would
> block a release for critical problems that are not regressions. This is
> not a data loss / 'deleting data' issue even if valid.
> You're welcome to provide feedback but votes are for the PMC.
> 
> On Fri, Jan 21, 2022 at 5:24 PM Bjørn Jørgensen
> <bjornjorgen...@gmail.com <mailto:bjornjorgen...@gmail.com>> wrote:
> 
>     Ok, but deleting users' data without them knowing it is never a good
>     idea. That's why I give this RC -1.
> 
>     lør. 22. jan. 2022 kl. 00:16 skrev Sean Owen <sro...@gmail.com
>     <mailto:sro...@gmail.com>>:
> 
>         (Bjorn - unless this is a regression, it would not block a
>         release, even if it's a bug)
> 
>         On Fri, Jan 21, 2022 at 5:09 PM Bjørn Jørgensen
>         <bjornjorgen...@gmail.com <mailto:bjornjorgen...@gmail.com>> wrote:
> 
> 
>                     [x] -1 Do not release this package because, deletes
>                     all my columns with only Null in it.  
> 
> 
>             I have
>             opened https://issues.apache.org/jira/browse/SPARK-37981
>             <https://issues.apache.org/jira/browse/SPARK-37981> for this
>             bug. 
> 
> 
> 
> 
>             fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen
>             <sro...@gmail.com <mailto:sro...@gmail.com>>:
> 
>                 (Are you suggesting this is a regression, or is it a
>                 general question? here we're trying to figure out
>                 whether there are critical bugs introduced in 3.2.1 vs
>                 3.2.0)
> 
>                 On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen
>                 <bjornjorgen...@gmail.com
>                 <mailto:bjornjorgen...@gmail.com>> wrote:
> 
>                     Hi, I am wondering if it's a bug or not.
> 
>                     I do have a lot of json files, where they have some
>                     columns that are all "null" on. 
> 
>                     I start spark with
> 
>                     from pyspark import pandas as ps
>                     import re
>                     import numpy as np
>                     import os
>                     import pandas as pd
> 
>                     from pyspark import SparkContext, SparkConf
>                     from pyspark.sql import SparkSession
>                     from pyspark.sql.functions import concat, concat_ws,
>                     lit, col, trim, expr
>                     from pyspark.sql.types import StructType,
>                     StructField, StringType,IntegerType
> 
>                     os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
> 
>                     def get_spark_session(app_name: str, conf: SparkConf):
>                         conf.setMaster('local[*]')
>                         conf \
>                           .set('spark.driver.memory', '64g')\
>                           .set("fs.s3a.access.key", "minio") \
>                           .set("fs.s3a.secret.key", "") \
>                           .set("fs.s3a.endpoint",
>                     "http://192.168.1.127:9000
>                     <http://192.168.1.127:9000>") \
>                           .set("spark.hadoop.fs.s3a.impl",
>                     "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>                           .set("spark.hadoop.fs.s3a.path.style.access",
>                     "true") \
>                           .set("spark.sql.repl.eagerEval.enabled", "True") \
>                           .set("spark.sql.adaptive.enabled", "True") \
>                           .set("spark.serializer",
>                     "org.apache.spark.serializer.KryoSerializer") \
>                           .set("spark.sql.repl.eagerEval.maxNumRows",
>                     "10000") \
>                           .set("sc.setLogLevel", "error")
>                        
>                         return
>                     
> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
> 
>                     spark = get_spark_session("Falk", SparkConf())
> 
>                     d3 =
>                     
> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
> 
>                     import pyspark
>                     def sparkShape(dataFrame):
>                         return (dataFrame.count(), len(dataFrame.columns))
>                     pyspark.sql.dataframe.DataFrame.shape = sparkShape
>                     print(d3.shape())
> 
> 
>                     (653610, 267)
> 
> 
>                     d3.write.json("d3.json")
> 
> 
>                     d3 = spark.read.json("d3.json/*.json")
> 
>                     import pyspark
>                     def sparkShape(dataFrame):
>                         return (dataFrame.count(), len(dataFrame.columns))
>                     pyspark.sql.dataframe.DataFrame.shape = sparkShape
>                     print(d3.shape())
> 
>                     (653610, 186)
> 
> 
>                     So spark is deleting 81 columns. I think that all of
>                     these 81 deleted columns have only Null in them.  
> 
>                     Is this a bug or has this been made on purpose?  
> 
> 
>                     fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao
>                     <huaxin.ga...@gmail.com
>                     <mailto:huaxin.ga...@gmail.com>>:
> 
> 
>                                 Please vote on releasing the following
>                                 candidate as Apache Spark version 3.2.1.
>                                 The vote is open until 8:00pm Pacific
>                                 time January 25 and passes if a majority
>                                 +1 PMC votes are cast, with a minimum of
>                                 3 +1 votes. [ ] +1 Release this package
>                                 as Apache Spark 3.2.1
> 
> 
>                                 [ ] -1 Do not release this package
>                                 because ... To learn more about Apache
>                                 Spark, please see
>                                 http://spark.apache.org/
>                                 <http://spark.apache.org/> The tag to be
>                                 voted on is v3.2.1-rc2 (commit 
> 
> 
>                                 4f25b3f71238a00508a356591553f2dfa89f8290):
> 
> 
>                                 
> https://github.com/apache/spark/tree/v3.2.1-rc2
>                                 
> <https://github.com/apache/spark/tree/v3.2.1-rc2> 
> 
> 
> 
>                                 The release files, including signatures,
>                                 digests, etc. can be found at:
> 
> 
>                                 
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
>                                 
> <https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/> 
> 
> 
> 
>                                 Signatures used for Spark RCs can be
>                                 found in this file:
>                                 
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>                                 
> <https://dist.apache.org/repos/dist/dev/spark/KEYS>
>                                 The staging repository for this release
>                                 can be found at:
> 
> 
>                                 
> https://repository.apache.org/content/repositories/orgapachespark-1398/
>                                 
> <https://repository.apache.org/content/repositories/orgapachespark-1398/>
> 
> 
> 
>                                 The documentation corresponding to this
>                                 release can be found at: 
> 
> 
>                                 
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/
>                                 
> <https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/> 
> 
> 
> 
>                                 The list of bug fixes going into 3.2.1
>                                 can be found at the following URL:
> 
> 
>                                 https://s.apache.org/yu0cy
>                                 <https://s.apache.org/yu0cy>
> 
> 
> 
>                                 This release is using the release script
>                                 of the tag
>                                 v3.2.1-rc2.FAQ=========================
>                                 How can I help test this release?
>                                 ========================= If you are a
>                                 Spark user, you can help us test this
>                                 release by taking an existing Spark
>                                 workload and running on this release
>                                 candidate, then reporting any
>                                 regressions. If you're working in
>                                 PySpark you can set up a virtual env and
>                                 install the current RC and see if
>                                 anything important breaks, in the
>                                 Java/Scala you can add the staging
>                                 repository to your projects resolvers
>                                 and test with the RC (make sure to clean
>                                 up the artifact cache before/after so
>                                 you don't end up building with a out of
>                                 date RC going forward).
>                                 ===========================================
>                                 What should happen to JIRA tickets still
>                                 targeting 3.2.1?
>                                 ===========================================
>                                 The current list of open tickets
>                                 targeted at 3.2.1 can be found at:
>                                 https://issues.apache.org/jira/projects/SPARK
>                                 
> <https://issues.apache.org/jira/projects/SPARK>and
>                                 search for "Target Version/s" = 3.2.1
>                                 Committers should look at those and
>                                 triage. Extremely important bug fixes,
>                                 documentation, and API tweaks that
>                                 impact compatibility should be worked on
>                                 immediately. Everything else please
>                                 retarget to an appropriate release.
>                                 ================== But my bug isn't
>                                 fixed? ================== In order to
>                                 make timely releases, we will typically
>                                 not hold the release unless the bug in
>                                 question is a regression from the
>                                 previous release. That being said, if
>                                 there is something which is a regression
>                                 that has not been correctly targeted
>                                 please ping me or a committer to help
>                                 target the issue.
> 
> 
> 
>                     -- 
>                     Bjørn Jørgensen
>                     Vestre Aspehaug 4, 6010 Ålesund
>                     Norge
> 
>                     +47 480 94 297
> 
> 
> 
>             -- 
>             Bjørn Jørgensen
>             Vestre Aspehaug 4, 6010 Ålesund
>             Norge
> 
>             +47 480 94 297
> 
> 
> 
>     -- 
>     Bjørn Jørgensen
>     Vestre Aspehaug 4, 6010 Ålesund
>     Norge
> 
>     +47 480 94 297
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to