Thanks for reporting it. Please open a JIRA with a test case.
Cheers,
Xiao
On Wed, May 27, 2020 at 1:42 PM Pasha Finkelshteyn <
pavel.finkelsht...@gmail.com> wrote:
> Hi folks,
>
> I'm implementing Kotlin bindings for Spark and faced strange problem. In
> one cornercase Spark works differently
Hi Randy,
Yes, I'm using parquet on both S3 and hdfs.
On Thu, 28 May, 2020, 2:38 am randy clinton, wrote:
> Is the file Parquet on S3 or is it some other file format?
>
> In general I would assume that HDFS read/writes are more performant for
> spark jobs.
>
> For instance, consider how well pa
Yes, that's exactly how I am creating them.
Question... Are you using 'Stateful Structured Streaming' in which you've
something like this?
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
updateAcrossEvents
)
And updating the Accumulator inside 'updateAcrossEvents'?
Is the file Parquet on S3 or is it some other file format?
In general I would assume that HDFS read/writes are more performant for
spark jobs.
For instance, consider how well partitioned your HDFS file is vs the S3
file.
On Wed, May 27, 2020 at 1:51 PM Dark Crusader
wrote:
> Hi Jörn,
>
> Thank
Hi folks,
I'm implementing Kotlin bindings for Spark and faced strange problem. In
one cornercase Spark works differently when wholestage codegen is on or
off.
Does it look like bug ot expected behavior?
--
Regards,
Pasha
Big Data Tools @ JetBrains
signature.asc
Description: PGP signature
Hi Jörn,
Thanks for the reply. I will try to create a easier example to reproduce
the issue.
I will also try your suggestion to look into the UI. Can you guide on what
I should be looking for?
I was already using the s3a protocol to compare the times.
My hunch is that multiple reads from S3 are
Have you looked in Spark UI why this is the case ?
S3 Reading can take more time - it depends also what s3 url you are using : s3a
vs s3n vs S3.
It could help after some calculation to persist in-memory or on HDFS. You can
also initially load from S3 and store on HDFS and work from there .
HD
Hi all,
I am reading data from hdfs in the form of parquet files (around 3 GB) and
running an algorithm from the spark ml library.
If I create the same spark dataframe by reading data from S3, the same
algorithm takes considerably more time.
I don't understand why this is happening. Is this a ch
Yes, I am talking about Application specific Accumulators. Actually I am
getting the values printed in my driver log as well as sent to Grafana. Not
sure where and when I saw 0 before. My deploy mode is “client” on a yarn
cluster(not local Mac) where I submit from master node. It should work the
sa
No firm dates; it always depends on RC voting. Another RC is coming soon.
It is however looking pretty close to done.
On Wed, May 27, 2020 at 3:54 AM ARNAV NEGI SOFTWARE ARCHITECT <
negi.ar...@gmail.com> wrote:
> Hi,
>
> I am working on Spark 3.0 preview release for large Spark jobs on
> Kubernet
I have no idea.
I compiled a docker image that you can find on docker hub and you can do some
experiments with it composing a cluster.
https://hub.docker.com/r/gaetanofabiano/spark
Let me know if you will have news about release
Regards
Inviato da iPhone
> Il giorno 27 mag 2020, alle ore 1
Hi,
I am working on Spark 3.0 preview release for large Spark jobs on
Kubernetes and preview looks promising.
Can I understand when the Spark 3.0 GA is expected? Definitive dates will
help us plan our roadmap with Spark 3.0.
Arnav Negi / Technical Architect | Web Technology Enthusiast
negi.ar..
Hi Team,
We are using spark on Kubernetes, through spark-on-k8s-operator. Our
application deals with multiple updateStateByKey operations. Upon
investigation, we found that the spark application consumes a higher volume of
memory. As spark-on-k8s-operator doesn't give the option to segregate spa
13 matches
Mail list logo