I closed the ticket as a duplicate of SPARK-29444 This behavior is neither a bug nor a regression and there is already a documented writer (or global) option that be can be used to modify it.
On 1/22/22 00:47, Sean Owen wrote: > Continue on the ticket - I am not sure this is established. We would > block a release for critical problems that are not regressions. This is > not a data loss / 'deleting data' issue even if valid. > You're welcome to provide feedback but votes are for the PMC. > > On Fri, Jan 21, 2022 at 5:24 PM Bjørn Jørgensen > <bjornjorgen...@gmail.com <mailto:bjornjorgen...@gmail.com>> wrote: > > Ok, but deleting users' data without them knowing it is never a good > idea. That's why I give this RC -1. > > lør. 22. jan. 2022 kl. 00:16 skrev Sean Owen <sro...@gmail.com > <mailto:sro...@gmail.com>>: > > (Bjorn - unless this is a regression, it would not block a > release, even if it's a bug) > > On Fri, Jan 21, 2022 at 5:09 PM Bjørn Jørgensen > <bjornjorgen...@gmail.com <mailto:bjornjorgen...@gmail.com>> wrote: > > > [x] -1 Do not release this package because, deletes > all my columns with only Null in it. > > > I have > opened https://issues.apache.org/jira/browse/SPARK-37981 > <https://issues.apache.org/jira/browse/SPARK-37981> for this > bug. > > > > > fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen > <sro...@gmail.com <mailto:sro...@gmail.com>>: > > (Are you suggesting this is a regression, or is it a > general question? here we're trying to figure out > whether there are critical bugs introduced in 3.2.1 vs > 3.2.0) > > On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen > <bjornjorgen...@gmail.com > <mailto:bjornjorgen...@gmail.com>> wrote: > > Hi, I am wondering if it's a bug or not. > > I do have a lot of json files, where they have some > columns that are all "null" on. > > I start spark with > > from pyspark import pandas as ps > import re > import numpy as np > import os > import pandas as pd > > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.functions import concat, concat_ws, > lit, col, trim, expr > from pyspark.sql.types import StructType, > StructField, StringType,IntegerType > > os.environ["PYARROW_IGNORE_TIMEZONE"]="1" > > def get_spark_session(app_name: str, conf: SparkConf): > conf.setMaster('local[*]') > conf \ > .set('spark.driver.memory', '64g')\ > .set("fs.s3a.access.key", "minio") \ > .set("fs.s3a.secret.key", "") \ > .set("fs.s3a.endpoint", > "http://192.168.1.127:9000 > <http://192.168.1.127:9000>") \ > .set("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") \ > .set("spark.hadoop.fs.s3a.path.style.access", > "true") \ > .set("spark.sql.repl.eagerEval.enabled", "True") \ > .set("spark.sql.adaptive.enabled", "True") \ > .set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") \ > .set("spark.sql.repl.eagerEval.maxNumRows", > "10000") \ > .set("sc.setLogLevel", "error") > > return > > SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() > > spark = get_spark_session("Falk", SparkConf()) > > d3 = > > spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") > > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > > > (653610, 267) > > > d3.write.json("d3.json") > > > d3 = spark.read.json("d3.json/*.json") > > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > > (653610, 186) > > > So spark is deleting 81 columns. I think that all of > these 81 deleted columns have only Null in them. > > Is this a bug or has this been made on purpose? > > > fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao > <huaxin.ga...@gmail.com > <mailto:huaxin.ga...@gmail.com>>: > > > Please vote on releasing the following > candidate as Apache Spark version 3.2.1. > The vote is open until 8:00pm Pacific > time January 25 and passes if a majority > +1 PMC votes are cast, with a minimum of > 3 +1 votes. [ ] +1 Release this package > as Apache Spark 3.2.1 > > > [ ] -1 Do not release this package > because ... To learn more about Apache > Spark, please see > http://spark.apache.org/ > <http://spark.apache.org/> The tag to be > voted on is v3.2.1-rc2 (commit > > > 4f25b3f71238a00508a356591553f2dfa89f8290): > > > > https://github.com/apache/spark/tree/v3.2.1-rc2 > > <https://github.com/apache/spark/tree/v3.2.1-rc2> > > > > The release files, including signatures, > digests, etc. can be found at: > > > > https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/ > > <https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/> > > > > Signatures used for Spark RCs can be > found in this file: > > https://dist.apache.org/repos/dist/dev/spark/KEYS > > <https://dist.apache.org/repos/dist/dev/spark/KEYS> > The staging repository for this release > can be found at: > > > > https://repository.apache.org/content/repositories/orgapachespark-1398/ > > <https://repository.apache.org/content/repositories/orgapachespark-1398/> > > > > The documentation corresponding to this > release can be found at: > > > > https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/ > > <https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/> > > > > The list of bug fixes going into 3.2.1 > can be found at the following URL: > > > https://s.apache.org/yu0cy > <https://s.apache.org/yu0cy> > > > > This release is using the release script > of the tag > v3.2.1-rc2.FAQ========================= > How can I help test this release? > ========================= If you are a > Spark user, you can help us test this > release by taking an existing Spark > workload and running on this release > candidate, then reporting any > regressions. If you're working in > PySpark you can set up a virtual env and > install the current RC and see if > anything important breaks, in the > Java/Scala you can add the staging > repository to your projects resolvers > and test with the RC (make sure to clean > up the artifact cache before/after so > you don't end up building with a out of > date RC going forward). > =========================================== > What should happen to JIRA tickets still > targeting 3.2.1? > =========================================== > The current list of open tickets > targeted at 3.2.1 can be found at: > https://issues.apache.org/jira/projects/SPARK > > <https://issues.apache.org/jira/projects/SPARK>and > search for "Target Version/s" = 3.2.1 > Committers should look at those and > triage. Extremely important bug fixes, > documentation, and API tweaks that > impact compatibility should be worked on > immediately. Everything else please > retarget to an appropriate release. > ================== But my bug isn't > fixed? ================== In order to > make timely releases, we will typically > not hold the release unless the bug in > question is a regression from the > previous release. That being said, if > there is something which is a regression > that has not been correctly targeted > please ping me or a committer to help > target the issue. > > > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 > > > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 > > > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 > -- Best regards, Maciej Szymkiewicz Web: https://zero323.net PGP: A30CEF0C31A501EC
OpenPGP_signature
Description: OpenPGP digital signature