[x] -1 Do not release this package because, deletes all my columns with only Null in it.
I have opened https://issues.apache.org/jira/browse/SPARK-37981 for this bug. fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen <sro...@gmail.com>: > (Are you suggesting this is a regression, or is it a general question? > here we're trying to figure out whether there are critical bugs introduced > in 3.2.1 vs 3.2.0) > > On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > >> Hi, I am wondering if it's a bug or not. >> >> I do have a lot of json files, where they have some columns that are all >> "null" on. >> >> I start spark with >> >> from pyspark import pandas as ps >> import re >> import numpy as np >> import os >> import pandas as pd >> >> from pyspark import SparkContext, SparkConf >> from pyspark.sql import SparkSession >> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr >> from pyspark.sql.types import StructType, StructField, >> StringType,IntegerType >> >> os.environ["PYARROW_IGNORE_TIMEZONE"]="1" >> >> def get_spark_session(app_name: str, conf: SparkConf): >> conf.setMaster('local[*]') >> conf \ >> .set('spark.driver.memory', '64g')\ >> .set("fs.s3a.access.key", "minio") \ >> .set("fs.s3a.secret.key", "") \ >> .set("fs.s3a.endpoint", "http://192.168.1.127:9000") \ >> .set("spark.hadoop.fs.s3a.impl", >> "org.apache.hadoop.fs.s3a.S3AFileSystem") \ >> .set("spark.hadoop.fs.s3a.path.style.access", "true") \ >> .set("spark.sql.repl.eagerEval.enabled", "True") \ >> .set("spark.sql.adaptive.enabled", "True") \ >> .set("spark.serializer", >> "org.apache.spark.serializer.KryoSerializer") \ >> .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \ >> .set("sc.setLogLevel", "error") >> >> return >> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() >> >> spark = get_spark_session("Falk", SparkConf()) >> >> d3 = >> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") >> >> import pyspark >> def sparkShape(dataFrame): >> return (dataFrame.count(), len(dataFrame.columns)) >> pyspark.sql.dataframe.DataFrame.shape = sparkShape >> print(d3.shape()) >> >> >> (653610, 267) >> >> >> d3.write.json("d3.json") >> >> >> d3 = spark.read.json("d3.json/*.json") >> >> import pyspark >> def sparkShape(dataFrame): >> return (dataFrame.count(), len(dataFrame.columns)) >> pyspark.sql.dataframe.DataFrame.shape = sparkShape >> print(d3.shape()) >> >> (653610, 186) >> >> >> So spark is deleting 81 columns. I think that all of these 81 deleted >> columns have only Null in them. >> >> Is this a bug or has this been made on purpose? >> >> >> fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao <huaxin.ga...@gmail.com>: >> >>> Please vote on releasing the following candidate as Apache Spark version >>> 3.2.1. The vote is open until 8:00pm Pacific time January 25 and passes if >>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 >>> Release this package as Apache Spark 3.2.1[ ] -1 Do not release this >>> package because ... To learn more about Apache Spark, please see >>> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit >>> 4f25b3f71238a00508a356591553f2dfa89f8290): >>> https://github.com/apache/spark/tree/v3.2.1-rc2 >>> The release files, including signatures, digests, etc. can be found at: >>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/ >>> Signatures used for Spark RCs can be found in this file: >>> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging >>> repository for this release can be found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1398/ >>> >>> The documentation corresponding to this release can be found at: >>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/ >>> The list of bug fixes going into 3.2.1 can be found at the following URL: >>> https://s.apache.org/yu0cy >>> >>> This release is using the release script of the tag v3.2.1-rc2. FAQ >>> ========================= How can I help test this release? >>> ========================= If you are a Spark user, you can help us test >>> this release by taking an existing Spark workload and running on this >>> release candidate, then reporting any regressions. If you're working in >>> PySpark you can set up a virtual env and install the current RC and see if >>> anything important breaks, in the Java/Scala you can add the staging >>> repository to your projects resolvers and test with the RC (make sure to >>> clean up the artifact cache before/after so you don't end up building with >>> a out of date RC going forward). >>> =========================================== What should happen to JIRA >>> tickets still targeting 3.2.1? =========================================== >>> The current list of open tickets targeted at 3.2.1 can be found at: >>> https://issues.apache.org/jira/projects/SPARK and search for "Target >>> Version/s" = 3.2.1 Committers should look at those and triage. Extremely >>> important bug fixes, documentation, and API tweaks that impact >>> compatibility should be worked on immediately. Everything else please >>> retarget to an appropriate release. ================== But my bug isn't >>> fixed? ================== In order to make timely releases, we will >>> typically not hold the release unless the bug in question is a regression >>> from the previous release. That being said, if there is something which is >>> a regression that has not been correctly targeted please ping me or a >>> committer to help target the issue. >>> >> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297