On Fri, Jan 21, 2022 at 6:48 PM Sean Owen <sro...@gmail.com> wrote:

> Continue on the ticket - I am not sure this is established. We would block
> a release for critical problems that are not regressions. This is not a
> data loss / 'deleting data' issue even if valid.
> You're welcome to provide feedback but votes are for the PMC.
>
To be clear users and developers are more than welcome to vote, but only
PMC votes are binding.

>
> On Fri, Jan 21, 2022 at 5:24 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
> wrote:
>
>> Ok, but deleting users' data without them knowing it is never a good
>> idea. That's why I give this RC -1.
>>
>> lør. 22. jan. 2022 kl. 00:16 skrev Sean Owen <sro...@gmail.com>:
>>
>>> (Bjorn - unless this is a regression, it would not block a release, even
>>> if it's a bug)
>>>
>>> On Fri, Jan 21, 2022 at 5:09 PM Bjørn Jørgensen <
>>> bjornjorgen...@gmail.com> wrote:
>>>
>>>> [x] -1 Do not release this package because, deletes all my columns with
>>>> only Null in it.
>>>>
>>>> I have opened https://issues.apache.org/jira/browse/SPARK-37981 for
>>>> this bug.
>>>>
>>>>
>>>>
>>>>
>>>> fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen <sro...@gmail.com>:
>>>>
>>>>> (Are you suggesting this is a regression, or is it a general question?
>>>>> here we're trying to figure out whether there are critical bugs introduced
>>>>> in 3.2.1 vs 3.2.0)
>>>>>
>>>>> On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen <
>>>>> bjornjorgen...@gmail.com> wrote:
>>>>>
>>>>>> Hi, I am wondering if it's a bug or not.
>>>>>>
>>>>>> I do have a lot of json files, where they have some columns that are
>>>>>> all "null" on.
>>>>>>
>>>>>> I start spark with
>>>>>>
>>>>>> from pyspark import pandas as ps
>>>>>> import re
>>>>>> import numpy as np
>>>>>> import os
>>>>>> import pandas as pd
>>>>>>
>>>>>> from pyspark import SparkContext, SparkConf
>>>>>> from pyspark.sql import SparkSession
>>>>>> from pyspark.sql.functions import concat, concat_ws, lit, col, trim,
>>>>>> expr
>>>>>> from pyspark.sql.types import StructType, StructField,
>>>>>> StringType,IntegerType
>>>>>>
>>>>>> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
>>>>>>
>>>>>> def get_spark_session(app_name: str, conf: SparkConf):
>>>>>>     conf.setMaster('local[*]')
>>>>>>     conf \
>>>>>>       .set('spark.driver.memory', '64g')\
>>>>>>       .set("fs.s3a.access.key", "minio") \
>>>>>>       .set("fs.s3a.secret.key", "") \
>>>>>>       .set("fs.s3a.endpoint", "http://192.168.1.127:9000";) \
>>>>>>       .set("spark.hadoop.fs.s3a.impl",
>>>>>> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>>>>>>       .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>>>>>>       .set("spark.sql.repl.eagerEval.enabled", "True") \
>>>>>>       .set("spark.sql.adaptive.enabled", "True") \
>>>>>>       .set("spark.serializer",
>>>>>> "org.apache.spark.serializer.KryoSerializer") \
>>>>>>       .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \
>>>>>>       .set("sc.setLogLevel", "error")
>>>>>>
>>>>>>     return
>>>>>> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
>>>>>>
>>>>>> spark = get_spark_session("Falk", SparkConf())
>>>>>>
>>>>>> d3 =
>>>>>> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
>>>>>>
>>>>>> import pyspark
>>>>>> def sparkShape(dataFrame):
>>>>>>     return (dataFrame.count(), len(dataFrame.columns))
>>>>>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>>>>>> print(d3.shape())
>>>>>>
>>>>>>
>>>>>> (653610, 267)
>>>>>>
>>>>>>
>>>>>> d3.write.json("d3.json")
>>>>>>
>>>>>>
>>>>>> d3 = spark.read.json("d3.json/*.json")
>>>>>>
>>>>>> import pyspark
>>>>>> def sparkShape(dataFrame):
>>>>>>     return (dataFrame.count(), len(dataFrame.columns))
>>>>>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>>>>>> print(d3.shape())
>>>>>>
>>>>>> (653610, 186)
>>>>>>
>>>>>>
>>>>>> So spark is deleting 81 columns. I think that all of these 81 deleted
>>>>>> columns have only Null in them.
>>>>>>
>>>>>> Is this a bug or has this been made on purpose?
>>>>>>
>>>>>>
>>>>>> fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao <huaxin.ga...@gmail.com
>>>>>> >:
>>>>>>
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version 3.2.1. The vote is open until 8:00pm Pacific time January 25 and
>>>>>>> passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 
>>>>>>> votes. [
>>>>>>> ] +1 Release this package as Apache Spark 3.2.1[ ] -1 Do not
>>>>>>> release this package because ... To learn more about Apache Spark, 
>>>>>>> please
>>>>>>> see http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2
>>>>>>> (commit 4f25b3f71238a00508a356591553f2dfa89f8290):
>>>>>>> https://github.com/apache/spark/tree/v3.2.1-rc2
>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>> at:https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
>>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging
>>>>>>> repository for this release can be found at:
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1398/
>>>>>>>
>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/
>>>>>>>
>>>>>>> The list of bug fixes going into 3.2.1 can be found at the following
>>>>>>> URL:https://s.apache.org/yu0cy
>>>>>>>
>>>>>>> This release is using the release script of the tag v3.2.1-rc2. FAQ
>>>>>>> ========================= How can I help test this release?
>>>>>>> ========================= If you are a Spark user, you can help us test
>>>>>>> this release by taking an existing Spark workload and running on this
>>>>>>> release candidate, then reporting any regressions. If you're working in
>>>>>>> PySpark you can set up a virtual env and install the current RC and see 
>>>>>>> if
>>>>>>> anything important breaks, in the Java/Scala you can add the staging
>>>>>>> repository to your projects resolvers and test with the RC (make sure to
>>>>>>> clean up the artifact cache before/after so you don't end up building 
>>>>>>> with
>>>>>>> a out of date RC going forward).
>>>>>>> =========================================== What should happen to JIRA
>>>>>>> tickets still targeting 3.2.1? 
>>>>>>> ===========================================
>>>>>>> The current list of open tickets targeted at 3.2.1 can be found at:
>>>>>>> https://issues.apache.org/jira/projects/SPARK and search for
>>>>>>> "Target Version/s" = 3.2.1 Committers should look at those and triage.
>>>>>>> Extremely important bug fixes, documentation, and API tweaks that impact
>>>>>>> compatibility should be worked on immediately. Everything else please
>>>>>>> retarget to an appropriate release. ================== But my bug isn't
>>>>>>> fixed? ================== In order to make timely releases, we will
>>>>>>> typically not hold the release unless the bug in question is a 
>>>>>>> regression
>>>>>>> from the previous release. That being said, if there is something which 
>>>>>>> is
>>>>>>> a regression that has not been correctly targeted please ping me or a
>>>>>>> committer to help target the issue.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bjørn Jørgensen
>>>>>> Vestre Aspehaug 4
>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>,
>>>>>> 6010 Ålesund
>>>>>> Norge
>>>>>>
>>>>>> +47 480 94 297
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4
>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>,
>>>> 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4
>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>,
>> 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Reply via email to