unsubscribe
Unsubscribe
unsubscribe
one please list down the steps how the pipeline automation works
when it comes to Pyspark based pipelines in Production ?
//William
On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven <
wim.vanleu...@highestpoint.biz> wrote:
> I think Sean is right, but in your argumentation you mention that
&g
s://sparkbyexamples.com/spark/spark-batch-processing-produce-consume-kafka-topic/
Regards,
William R
Hi all, I've got a problem that really has me stumped. I'm running a
Structured Streaming query that reads from Kafka, performs some
transformations and stateful aggregations (using flatMapGroupsWithState),
and outputs any updated aggregates to another Kafka topic.
I'm running this job using Spark
Dennis, do you know what’s taking the additional time? Is it the Spark Job,
or oozie waiting for allocation from YARN? Do you have resource contention
issue in YARN?
On Fri, Jul 19, 2019 at 12:24 AM Bartek Dobija
wrote:
> Hi Dennis,
>
> Oozie jobs shouldn't take that long in a well configured cl
Hi Xiao,
Just report this with JIRA SPARK-28103.
https://issues.apache.org/jira/browse/SPARK-28103
Thanks and Regards,
William
On Wed, 19 Jun 2019 at 1:35 AM, Xiao Li wrote:
> Hi, William,
>
> Thanks for reporting it. Could you open a JIRA?
>
> Cheers,
>
> Xiao
>
>
BTW, I noticed a workaround is creating a custom rule to remove 'empty
local relation' from a union table. However, I am not 100% sure if it is
the right approach.
On Tue, Jun 18, 2019 at 11:53 PM William Wong wrote:
> Dear all,
>
> I am not sure if it is something expected o
b,c,d)) && isnotnull(id#0))
: +- Relation[id#0,val#1] parquet
+- Filter ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4
= b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4))
+- Relation[id#4,val#5] parquet
scala> spark.sq
al#5] Batched: true,
Format: Parquet, Location:
InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table2],
PartitionFilters: [], *PushedFilters: [EqualTo(id,a), IsNotNull(id)],*
ReadSchema: struct
scala>
Thanks and regards,
William
On Sat, Jun 15, 2019 at 12:13 AM William Wong wrote:
- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string,
true]))
+- *(3) Project [id#23, val#24]
+- *(3) Filter isnotnull(id#23)
+- *(3) FileScan parquet default.table2[id#23,val#24] Batched:
true, Format: Parquet, Location:
InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table2],
PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema:
struct
Appreciate if anyone has an idea on it. Many thanks.
Best regards,
William
I've been looking at the source code of the PySpark date_add function
(https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html#date_add)
and I'm wondering why the days input variable is not cast to a java column
like the start variable. This effectively means that whe
I noticed that Spark 2.4.0 implemented support for reading only committed
messages in Kafka, and was excited. Are there currently any plans to update
the Kafka output sink to support exactly-once delivery?
Thanks,
Will
r in
> number executors are preferred over larger and fewer executors.
> Changing GC algorithm
>
> http://orastack.com/spark-scaling-to-large-datasets.html
>
>
> Here are a few tips
>
>
>
>
> On Wed, Jan 9, 2019 at 1:55 PM Dillon Dukek
> wrote:
>
>> Hi
Hi there,
We've encountered Spark executor Java OOM issues for our Spark application.
Any tips on how to troubleshoot to identify what objects are occupying the
heap? In the past, dealing with JVM OOM, we've worked with analyzing heap
dumps, but we are having a hard time with locating Spark heap d
What you're seeing is the expected behavior in both cases.
One way to achieve the semantics you want in both situations is to read in
the Kudu table to a data frame, then filter it in Spark SQL to contain just
the rows you want to delete, and then use that dataframe to do the
deletion. There shoul
I recently upgraded a Structured Streaming application from Spark 2.2.1 ->
Spark 2.3.0. This application runs in yarn-cluster mode, and it made use of
the spark.yarn.{driver|executor}.memoryOverhead properties. I noticed the
job started crashing unexpectedly, and after doing a bunch of digging, it
I am running a Structured Streaming job (Spark 2.2.0) using EMR 5.9. The
job sources data from a Kafka topic, performs a variety of filters and
transformations, and sinks data back into a different Kafka topic.
Once per day, we stop the query in order to merge the namenode edit logs
with the fsima
It seems when doing a union on a DF where one DF contains lit(null) or null
for a String, causes a:
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String
cannot be cast to java.lang.String
when doing getString(i) on a Row within forEachPartition.
Stack:
Caused by: java.lang.Clas
Was there an answer to this? I get this periodically when a job has died
from an error and I run another job. I have gotten around it by going to
/var/lib/hive/metastore/metastore_db and removing the *.lck files. I am
sure this is the exact wrong thing to do as I imagine those lock files
exist to p
Hello,
I am trying to figure out how to correctly set config options in jupyter
when I am already provided a SparkContext and a HiveContext. I need to
increase a couple of memory allocations. My program dies indicating that I
am trying to call methods on a stopped SparkContext. I thought I had
cre
Does anyone have a link handy that describes configuring Ganglia on the mac?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
quot;, "6g")
.setExecutorEnv("spark.dynamicAllocation.enabled", "true")
.setExecutorEnv("spark.executor.cores","8")
.setExecutorEnv("spark.executor.memory", "6g")
Any one knows how to make the worker or executor use more memory?
Thanks,
William.
I see. Thanks!
From: Steve Loughran mailto:ste...@hortonworks.com>>
Date: Friday, October 30, 2015 at 12:03 PM
To: William Li mailto:a-...@expedia.com>>
Cc: "Zhang, Jingyu"
mailto:jingyu.zh...@news.com.au>>, user
mailto:user@spark.apache.org>>
Subject: Re: Sav
Thanks for your response. My secret has a back splash (/) so it didn't work...
From: "Zhang, Jingyu"
mailto:jingyu.zh...@news.com.au>>
Date: Thursday, October 29, 2015 at 5:16 PM
To: William Li mailto:a-...@expedia.com>>
Cc: user mailto:user@spark.apache.org>>
Su
the different access, then load the data from the local file.
Is there any other options without requiring save and re-load files?
Thanks,
William.
your help!
William.
On 10/22/15, 1:44 PM, "Sean Owen" wrote:
>Maven, in general, does some local caching to avoid htting the repo
>every time. It's possible this is why you're not seeing 1.5.1. On the
>command line you can for example add "mvn -U ..." Not sure
sday, October 22, 2015 at 10:36 AM
To: William Li mailto:a-...@expedia.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: Re: Maven Repository Hosting for Spark SQL 1.5.1
I can see this artifact in pub
to contact to verify that?
Thanks,
William.
);
graph.connectedComponents().vertices
From: Robin East [mailto:robin.e...@xense.co.uk]
Sent: den 5 oktober 2015 19:07
To: William Saar ; user@spark.apache.org
Subject: Re: Graphx hangs and crashes on EdgeRDD creation
Have you tried using Graph.partitionBy? e.g. using
PartitionStrategy.RandomVertexCut
Hi,
I am trying to run a GraphX job on 20 million edges with Spark 1.5.1, but the
job seems to hang for 30 minutes on a single executor when creating the graph
and eventually crashes with "IllegalArgumentException: Size exceeds
Integer.MAX_VALUE"
I suspect this is because of partitioning proble
When submitting to YARN, you can specify two different operation modes for
the driver with the "--master" parameter: yarn-client or yarn-cluster. For
more information on submitting to YARN, see this page in the Spark docs:
http://spark.apache.org/docs/latest/running-on-yarn.html
yarn-cluster mode
Could you share your pattern matching expression that is failing?
On Tue, Aug 18, 2015, 3:38 PM wrote:
> Hi all,
>
> I am trying to run a spark job, in which I receive *java.math.BigDecimal*
> objects,
> instead of the scala equivalents, and I am trying to convert them into
> Doubles.
> If I t
Yes I upgraded but I would still like to set an overall stage timeout. Does
that exist?
On Fri, Jul 31, 2015 at 1:13 PM, Ted Yu wrote:
> The referenced bug has been fixed in 1.4.0, are you able to upgrade ?
>
> Cheers
>
> On Fri, Jul 31, 2015 at 10:01 AM, William Kinney &
Hi,
I had a job that got stuck on yarn due to
https://issues.apache.org/jira/browse/SPARK-6954
It never exited properly.
Is there a way to set a timeout for a stage or all stages?
There seems to be a bit of confusion here - the OP (doing the PhD) had the
thread hijacked by someone with a similar name asking a mundane question.
It would be a shame to send someone away so rudely, who may do valuable
work on Spark.
Sashidar (not Sashid!) I'm personally interested in running g
Ted,
Thanks very much for your reply. It took me almost a week but I have
finally had a chance to implement what you noted and it appears to be
working locally. However, when I launch this onto a cluster on EC2 -- this
doesn't work reliably.
To expand, I think the issue is that some of the code w
ata. It will be helpful
> if you can paste your code so that we will understand it better.
>
> Thanks
> Best Regards
>
> On Wed, Jun 24, 2015 at 2:32 PM, William Ferrell
> wrote:
>
>> Hello -
>>
>> I am using Apache Spark 1.2.1 via pyspark. Thanks to
Hello -
I am using Apache Spark 1.2.1 via pyspark. Thanks to any developers here
for the great product!
In my use case, I am running spark jobs to extract data from some raw data.
Generally this works quite well.
However, I am noticing that for certain data sets there are certain tasks
that are
ives a few
suggestions on how to increase the number of partitions.
-Will
On Mon, Jun 15, 2015 at 5:00 PM, William Briggs wrote:
> There are a lot of variables to consider. I'm not an expert on Spark, and
> my ML knowledge is rudimentary at best, but here are some questions whose
> answers m
There are a lot of variables to consider. I'm not an expert on Spark, and
my ML knowledge is rudimentary at best, but here are some questions whose
answers might help us to help you:
- What type of Spark cluster are you running (e.g., Stand-alone, Mesos,
YARN)?
- What does the HTTP UI tel
I don't know anything about your use case, so take this with a grain of
salt, but typically if you are operating at a scale that benefits from
Spark, then you likely will not want to write your output records as
individual files into HDFS. Spark has built-in support for the Hadoop
"SequenceFile" co
Hi Lee, I'm stuck with only mobile devices for correspondence right now, so
I can't get to shell to play with this issue - this is all supposition; I
think that the lambdas are closing over the context because it's a
constructor parameter to your Runnable class, which is why inlining the
lambdas in
Hi Lee,
You should be able to create a PairRDD using the Nonce as the key, and the
AnalyticsEvent as the value. I'm very new to Spark, but here is some
uncompilable pseudo code that may or may not help:
events.map(event => (event.getNonce, event)).reduceByKey((a, b) =>
a).map(_._2)
The above cod
Hi Kaspar,
This is definitely doable, but in my opinion, it's important to remember
that, at its core, Spark is based around a functional programming paradigm
- you're taking input sets of data and, by applying various
transformations, you end up with a dataset that represents your "answer".
Witho
Ah, thanks for the help! That worked great.
On Wed, Jul 30, 2014 at 10:31 AM, Zongheng Yang
wrote:
> To add to this: for this many (>= 20) machines I usually use at least
> --wait 600.
>
> On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas
> wrote:
> > William,
> >
I am running Spark 0.9.1 and Shark 0.9.1. Sorry I didn't include that.
On Thu, Jul 31, 2014 at 9:50 AM, William Cox wrote:
> *The Shark-specific group appears to be in moderation pause, so I'm asking
> here.*
>
> I'm running Shark/Spark on EC2. I am using Shark to qu
ng that I change to allow it to write to a S3 file
system? I've tried all sorts of different queries to write to S3. This
particular one was:
> INSERT OVERWRITE DIRECTORY 's3n://id:key@shadoop/bucket' SELECT * FROM
> table;
Thanks for your help!
-William
_ssh_tar)
> File "./spark_ec2.py", line 640, in ssh_write
> raise RuntimeError("ssh_write failed with error %s" % proc.returncode)
> RuntimeError: ssh_write failed with error 255
So I log into the EC2 console and TERMINATE that specific machine, and
re-resume. Now it finally appears to be installing software on the machines.
Any ideas why certain machines refuse SSH connections, or why the master
refuses for several minutes and then allows?
THanks.
-William
Internally, we just use the BlockId's natural hashing
and equality to do lookups and puts, so it should work fine. However, since it
is in no way public API, it may change even in maintenance releases.
On Sun, Jul 20, 2014 at 10:25 PM, william wrote:
When spark is 0.7.3, I use SparkEnv.
thank you Stephen
-- 原始邮件 --
发件人: "Stephen Boesch";;
发送时间: 2014年7月21日(星期一) 中午11:55
收件人: "user";
主题: Re: What does @developerApi means?
The javaDoc seems reasonably helpful:
/**
* A lower-level, unstable API intended for developers.
*
* Developer API's mi
When spark is 0.7.3, I use SparkEnv.get.blockManager.getLocal("model") and
SparkEnv.get.blockManager.put("model", buf, StorageLevel.MEMORY_ONLY, false) to
cached model object
When I porting to spark 1.0.1, I found SparkEnv.get.blockManager.getLocal &
SparkEnv.get.blockManager.put's APIs chang
Hi,
We are moving into adopting the full stack of Spark. So far, we have used
Shark to do some ETL work, which is not bad but is not prefect either. We
ended writing UDF and UDGF, UDAF that can be avoided if we could use Pig.
Do you have any suggestions with the ETL solution in Spark stack?
And d
Hi,
Any comments or thoughts on the implications of the newly released feature
from Hadoop 2.3 on the centralized cache? How different it is from RDD?
Many thanks.
Cao
56 matches
Mail list logo