Hi, I am wondering if it's a bug or not.

I do have a lot of json files, where they have some columns that are all
"null" on.

I start spark with

from pyspark import pandas as ps
import re
import numpy as np
import os
import pandas as pd

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
from pyspark.sql.types import StructType, StructField,
StringType,IntegerType

os.environ["PYARROW_IGNORE_TIMEZONE"]="1"

def get_spark_session(app_name: str, conf: SparkConf):
    conf.setMaster('local[*]')
    conf \
      .set('spark.driver.memory', '64g')\
      .set("fs.s3a.access.key", "minio") \
      .set("fs.s3a.secret.key", "") \
      .set("fs.s3a.endpoint", "http://192.168.1.127:9000";) \
      .set("spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem") \
      .set("spark.hadoop.fs.s3a.path.style.access", "true") \
      .set("spark.sql.repl.eagerEval.enabled", "True") \
      .set("spark.sql.adaptive.enabled", "True") \
      .set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
      .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \
      .set("sc.setLogLevel", "error")

    return
SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()

spark = get_spark_session("Falk", SparkConf())

d3 =
spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")

import pyspark
def sparkShape(dataFrame):
    return (dataFrame.count(), len(dataFrame.columns))
pyspark.sql.dataframe.DataFrame.shape = sparkShape
print(d3.shape())


(653610, 267)


d3.write.json("d3.json")


d3 = spark.read.json("d3.json/*.json")

import pyspark
def sparkShape(dataFrame):
    return (dataFrame.count(), len(dataFrame.columns))
pyspark.sql.dataframe.DataFrame.shape = sparkShape
print(d3.shape())

(653610, 186)


So spark is deleting 81 columns. I think that all of these 81 deleted
columns have only Null in them.

Is this a bug or has this been made on purpose?


fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao <huaxin.ga...@gmail.com>:

> Please vote on releasing the following candidate as Apache Spark version
> 3.2.1. The vote is open until 8:00pm Pacific time January 25 and passes if
> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1
> Release this package as Apache Spark 3.2.1[ ] -1 Do not release this
> package because ... To learn more about Apache Spark, please see
> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit
> 4f25b3f71238a00508a356591553f2dfa89f8290):
> https://github.com/apache/spark/tree/v3.2.1-rc2
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging repository
> for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1398/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/
> The list of bug fixes going into 3.2.1 can be found at the following URL:
> https://s.apache.org/yu0cy
>
> This release is using the release script of the tag v3.2.1-rc2. FAQ
> ========================= How can I help test this release?
> ========================= If you are a Spark user, you can help us test
> this release by taking an existing Spark workload and running on this
> release candidate, then reporting any regressions. If you're working in
> PySpark you can set up a virtual env and install the current RC and see if
> anything important breaks, in the Java/Scala you can add the staging
> repository to your projects resolvers and test with the RC (make sure to
> clean up the artifact cache before/after so you don't end up building with
> a out of date RC going forward).
> =========================================== What should happen to JIRA
> tickets still targeting 3.2.1? ===========================================
> The current list of open tickets targeted at 3.2.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.1 Committers should look at those and triage. Extremely
> important bug fixes, documentation, and API tweaks that impact
> compatibility should be worked on immediately. Everything else please
> retarget to an appropriate release. ================== But my bug isn't
> fixed? ================== In order to make timely releases, we will
> typically not hold the release unless the bug in question is a regression
> from the previous release. That being said, if there is something which is
> a regression that has not been correctly targeted please ping me or a
> committer to help target the issue.
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Reply via email to