Thanks a ton for the help!
Is there a standardized way of converting the internal row to rows?
I’ve tried this but im getting an exception
val enconder = RowEncoder(df.schema)
val rows = readFile(pFile).flatMap(_ match {
case r: InternalRow => Seq(r)
case b: ColumnarBatch => b.rowIterator().
You're getting InternalRow instances. They probably have the data you want,
but the toString representation doesn't match the data for InternalRow.
On Thu, Mar 21, 2019 at 3:28 PM Long, Andrew
wrote:
> Hello Friends,
>
>
>
> I’m working on a performance improvement that reads additional parquet
Hello Friends,
I’m working on a performance improvement that reads additional parquet files in
the middle of a lambda and I’m running into some issues. This is what id like
todo
ds.mapPartitions(x=>{
//read parquet file in and perform an operation with x
})
Here’s my current POC code but
Hello,
sparkMeasure is a great tool that is indeed helpful for me but
unfortunately, it doesn't measure the network communication time/cost.
it is stated as a limitation in the GitHub page :
- The currently available Spark task metrics can give you precious
quantitative information on res
i tweaked some apache settings (MaxClients increased to fix an error i
found buried in the logs, and added 'retry' and 'acquire' to the reverse
proxy settings to hopefully combat the dreaded 502 response), restarted
httpd and things actually seem quite snappy right now!
i'm not holding my breath,
Hello ,
I need to cross my data and i'm executing a cross join on two dataframes .
C = A.crossJoin(B)
A has 50 records
B has 5 records
the result im getting with spark 2.0 is a dataframe C having 50 records.
only the first row from B was added to C.
Is that a bug in Spark?
Asma ZGOLLI
PhD st
While I agree with you that it would be ideal to have the task level resources
and do a deeper redesign for the scheduler, I think that can be a separate
enhancement like was discussed earlier in the thread. That feature is useful
without GPU's. I do realize that they overlap some but I think
I understand the application-level, static, global nature
of spark.task.accelerator.gpu.count and its similarity to the
existing spark.task.cpus, but to me this feels like extending a weakness of
Spark's scheduler, not building on its strengths. That is because I
consider binding the number of core
How about using this: https://github.com/LucaCanali/sparkMeasure
Sent from my iPhone
On Mar 21, 2019, at 7:46 AM, asma zgolli
mailto:zgollia...@gmail.com>> wrote:
Hello ,
is there a way to get the network statistics, server and distribution
statistics from spark?
I m looking for that inform
Hello ,
is there a way to get the network statistics, server and distribution
statistics from spark?
I m looking for that information in order to work on network communication
performance.
thank you very much for your help
kind regards
Asma ZGOLLI
PhD student in data engineering - computer scie
Tthe proposal here is that all your resources are static and the gpu per task
config is global per application, meaning you ask for a certain amount memory,
cpu, GPUs for every executor up front just like you do today and every executor
you get is that size. This means that both static or dyna
Thanks for the quick feedbacks, Maciej and Shawn!
Maciej:
The concern about confusing users with supporting multiple datetime
patterns is a valid one. The cleanest way to introduce SQL:2016 patterns
would be to drop the existing pattern support (SimpleDateFormat in case of
Impala) and replace it w
Thanks for this SPIP.
I cannot comment on the docs, but just wanted to highlight one thing. In
page 5 of the SPIP, when we talk about DRA, I see:
"For instance, if each executor consists 4 CPUs and 2 GPUs, and each task
requires 1 CPU and 1GPU, then we shall throw an error on application start
bec
13 matches
Mail list logo