See this ticket https://issues.apache.org/jira/browse/HADOOP-17201. It may
help your team.
From: Johnny Burns
Sent: Tuesday, June 22, 2021 3:41 PM
To: user@spark.apache.org
Cc: data-orchestration-team
Subject: Performance Problems Migrating to S3A Committers
H
Doesn't a persist break stages?
On Thu, Aug 5, 2021, 11:40 AM Tom Graves
wrote:
> As Sean mentioned its only available at Stage level but you said you don't
> want to shuffle so splitting into stages doesn't help you. Without more
> details it seems like you could "hack" this by just requesting
As Sean mentioned its only available at Stage level but you said you don't
want to shuffle so splitting into stages doesn't help you. Without more
details it seems like you could "hack" this by just requesting an executor with
1 GPU (allowing 2 tasks per gpu) and 2 CPUs and the one task would
I am not sure why you need to create an RDD first. You can create a
data frame directly from csv file, for instance:
spark.read.format("csv").option("header","true").schema(yourSchema).load(ftpUrl)
-- ND
On 8/5/21 3:14 AM, igyu wrote:
val ftpUrl ="ftp://test:test@ip:21/upload/test/_temporary/
Hi,
we are trying to migrate some of the data lake pipelines to run in SPARK
3.x, where as the dependent pipelines using those tables will be still
running in SPARK 2.4.x for sometime to come.
Does anyone know of any issues that can happen:
1. when reading Parquet files written in 3.1.x in SPARK
May be this link will help you.
https://stackoverflow.com/questions/41898144/convert-rddstring-to-rddrow-to-dataframe-spark-scala
On Thu, Aug 5, 2021 at 12:46 PM igyu wrote:
> val ftpUrl =
> "ftp://test:test@ip:21/upload/test/_temporary/0/_temporary/task_2019124756_0002_m_00_0/*";
> val
val ftpUrl =
"ftp://test:test@ip:21/upload/test/_temporary/0/_temporary/task_2019124756_0002_m_00_0/*";
val rdd = spark.sparkContext.wholeTextFiles(ftpUrl)
val value = rdd.map(_._2).map(csv=>csv.split(",").toSeq)
val schemas = StructType(List(
new StructField("id", DataTypes.Strin