unsubscribe

2023-03-06 Thread William R
unsubscribe

Unsubscribe

2022-02-14 Thread William R
Unsubscribe

unsubscribe

2021-10-16 Thread William R
unsubscribe

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread William R
one please list down the steps how the pipeline automation works when it comes to Pyspark based pipelines in Production ? //William On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven < wim.vanleu...@highestpoint.biz> wrote: > I think Sean is right, but in your argumentation you mention that &g

HDP 3.1 spark Kafka dependency

2020-03-18 Thread William R
s://sparkbyexamples.com/spark/spark-batch-processing-produce-consume-kafka-topic/ Regards, William R

Structured Streaming - HDFS State Store Performance Issues

2020-01-14 Thread William Briggs
Hi all, I've got a problem that really has me stumped. I'm running a Structured Streaming query that reads from Kafka, performs some transformations and stateful aggregations (using flatMapGroupsWithState), and outputs any updated aggregates to another Kafka topic. I'm running this job using Spark

Re: Spark and Oozie

2019-07-19 Thread William Shen
Dennis, do you know what’s taking the additional time? Is it the Spark Job, or oozie waiting for allocation from YARN? Do you have resource contention issue in YARN? On Fri, Jul 19, 2019 at 12:24 AM Bartek Dobija wrote: > Hi Dennis, > > Oozie jobs shouldn't take that long in a well configured cl

Re: Filter cannot be pushed via a Join

2019-06-18 Thread William Wong
Hi Xiao, Just report this with JIRA SPARK-28103. https://issues.apache.org/jira/browse/SPARK-28103 Thanks and Regards, William On Wed, 19 Jun 2019 at 1:35 AM, Xiao Li wrote: > Hi, William, > > Thanks for reporting it. Could you open a JIRA? > > Cheers, > > Xiao > >

Re: Filter cannot be pushed via a Join

2019-06-18 Thread William Wong
BTW, I noticed a workaround is creating a custom rule to remove 'empty local relation' from a union table. However, I am not 100% sure if it is the right approach. On Tue, Jun 18, 2019 at 11:53 PM William Wong wrote: > Dear all, > > I am not sure if it is something expected o

Re: Filter cannot be pushed via a Join

2019-06-18 Thread William Wong
b,c,d)) && isnotnull(id#0)) : +- Relation[id#0,val#1] parquet +- Filter ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)) +- Relation[id#4,val#5] parquet scala> spark.sq

Re: Filter cannot be pushed via a Join

2019-06-14 Thread William Wong
al#5] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table2], PartitionFilters: [], *PushedFilters: [EqualTo(id,a), IsNotNull(id)],* ReadSchema: struct scala> Thanks and regards, William On Sat, Jun 15, 2019 at 12:13 AM William Wong wrote:

Filter cannot be pushed via a Join

2019-06-14 Thread William Wong
- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) +- *(3) Project [id#23, val#24] +- *(3) Filter isnotnull(id#23) +- *(3) FileScan parquet default.table2[id#23,val#24] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/williamwong/spark-warehouse/table2], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct Appreciate if anyone has an idea on it. Many thanks. Best regards, William

PysPark date_add function suggestion

2019-03-06 Thread William Creger
I've been looking at the source code of the PySpark date_add function (https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html#date_add) and I'm wondering why the days input variable is not cast to a java column like the start variable. This effectively means that whe

Exactly-Once delivery with Structured Streaming and Kafka

2019-01-31 Thread William Briggs
I noticed that Spark 2.4.0 implemented support for reading only committed messages in Kafka, and was excited. Are there currently any plans to update the Kafka output sink to support exactly-once delivery? Thanks, Will

Re: Troubleshooting Spark OOM

2019-01-09 Thread William Shen
r in > number executors are preferred over larger and fewer executors. > Changing GC algorithm > > http://orastack.com/spark-scaling-to-large-datasets.html > > > Here are a few tips > > > > > On Wed, Jan 9, 2019 at 1:55 PM Dillon Dukek > wrote: > >> Hi

Troubleshooting Spark OOM

2019-01-09 Thread William Shen
Hi there, We've encountered Spark executor Java OOM issues for our Spark application. Any tips on how to troubleshoot to identify what objects are occupying the heap? In the past, dealing with JVM OOM, we've worked with analyzing heap dumps, but we are having a hard time with locating Spark heap d

Re: spark kudu issues

2018-06-20 Thread William Berkeley
What you're seeing is the expected behavior in both cases. One way to achieve the semantics you want in both situations is to read in the Kudu table to a data frame, then filter it in Spark SQL to contain just the rows you want to delete, and then use that dataframe to do the deletion. There shoul

Change in configuration settings?

2018-06-08 Thread William Briggs
I recently upgraded a Structured Streaming application from Spark 2.2.1 -> Spark 2.3.0. This application runs in yarn-cluster mode, and it made use of the spark.yarn.{driver|executor}.memoryOverhead properties. I noticed the job started crashing unexpectedly, and after doing a bunch of digging, it

Structured Streaming + Kafka - Corrupted Checkpoint Offsets / Commits

2018-01-04 Thread William Briggs
I am running a Structured Streaming job (Spark 2.2.0) using EMR 5.9. The job sources data from a Kafka topic, performs a variety of filters and transformations, and sinks data back into a different Kafka topic. Once per day, we stop the query in order to merge the namenode edit logs with the fsima

unsubscribe

2017-01-09 Thread william tellme

spark 2.0.1, union on non-null and null String dataframes causing ClassCastException UTF8String cannot be cast to java.lang.String

2016-10-06 Thread William Kinney
It seems when doing a union on a DF where one DF contains lit(null) or null for a String, causes a: java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.String when doing getString(i) on a Row within forEachPartition. Stack: Caused by: java.lang.Clas

Re: pyspark ML example not working

2016-09-29 Thread William Kupersanin
Was there an answer to this? I get this periodically when a job has died from an error and I run another job. I have gotten around it by going to /var/lib/hive/metastore/metastore_db and removing the *.lck files. I am sure this is the exact wrong thing to do as I imagine those lock files exist to p

Setting conf options in jupyter

2016-09-29 Thread William Kupersanin
Hello, I am trying to figure out how to correctly set config options in jupyter when I am already provided a SparkContext and a HiveContext. I need to increase a couple of memory allocations. My program dies indicating that I am trying to call methods on a stopped SparkContext. I thought I had cre

Monitoring Spark with Ganglia on ElCapo

2016-01-17 Thread william tellme
Does anyone have a link handy that describes configuring Ganglia on the mac? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Memory are not used according to setting

2015-11-04 Thread William Li
quot;, "6g") .setExecutorEnv("spark.dynamicAllocation.enabled", "true") .setExecutorEnv("spark.executor.cores","8") .setExecutorEnv("spark.executor.memory", "6g") Any one knows how to make the worker or executor use more memory? Thanks, William.

Re: Save data to different S3

2015-10-30 Thread William Li
I see. Thanks! From: Steve Loughran mailto:ste...@hortonworks.com>> Date: Friday, October 30, 2015 at 12:03 PM To: William Li mailto:a-...@expedia.com>> Cc: "Zhang, Jingyu" mailto:jingyu.zh...@news.com.au>>, user mailto:user@spark.apache.org>> Subject: Re: Sav

Re: Save data to different S3

2015-10-30 Thread William Li
Thanks for your response. My secret has a back splash (/) so it didn't work... From: "Zhang, Jingyu" mailto:jingyu.zh...@news.com.au>> Date: Thursday, October 29, 2015 at 5:16 PM To: William Li mailto:a-...@expedia.com>> Cc: user mailto:user@spark.apache.org>> Su

Save data to different S3

2015-10-29 Thread William Li
the different access, then load the data from the local file. Is there any other options without requiring save and re-load files? Thanks, William.

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread William Li
your help! William. On 10/22/15, 1:44 PM, "Sean Owen" wrote: >Maven, in general, does some local caching to avoid htting the repo >every time. It's possible this is why you're not seeing 1.5.1. On the >command line you can for example add "mvn -U ..." Not sure

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread William Li
sday, October 22, 2015 at 10:36 AM To: William Li mailto:a-...@expedia.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: Maven Repository Hosting for Spark SQL 1.5.1 I can see this artifact in pub

Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread William Li
to contact to verify that? Thanks, William.

RE: Graphx hangs and crashes on EdgeRDD creation

2015-10-06 Thread William Saar
); graph.connectedComponents().vertices From: Robin East [mailto:robin.e...@xense.co.uk] Sent: den 5 oktober 2015 19:07 To: William Saar ; user@spark.apache.org Subject: Re: Graphx hangs and crashes on EdgeRDD creation Have you tried using Graph.partitionBy? e.g. using PartitionStrategy.RandomVertexCut

Graphx hangs and crashes on EdgeRDD creation

2015-10-05 Thread William Saar
Hi, I am trying to run a GraphX job on 20 million edges with Spark 1.5.1, but the job seems to hang for 30 minutes on a single executor when creating the graph and eventually crashes with "IllegalArgumentException: Size exceeds Integer.MAX_VALUE" I suspect this is because of partitioning proble

Re: How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread William Briggs
When submitting to YARN, you can specify two different operation modes for the driver with the "--master" parameter: yarn-client or yarn-cluster. For more information on submitting to YARN, see this page in the Spark docs: http://spark.apache.org/docs/latest/running-on-yarn.html yarn-cluster mode

Re: Scala: How to match a java object????

2015-08-18 Thread William Briggs
Could you share your pattern matching expression that is failing? On Tue, Aug 18, 2015, 3:38 PM wrote: > Hi all, > > I am trying to run a spark job, in which I receive *java.math.BigDecimal* > objects, > instead of the scala equivalents, and I am trying to convert them into > Doubles. > If I t

Re: Setting a stage timeout

2015-08-04 Thread William Kinney
Yes I upgraded but I would still like to set an overall stage timeout. Does that exist? On Fri, Jul 31, 2015 at 1:13 PM, Ted Yu wrote: > The referenced bug has been fixed in 1.4.0, are you able to upgrade ? > > Cheers > > On Fri, Jul 31, 2015 at 10:01 AM, William Kinney &

Setting a stage timeout

2015-07-31 Thread William Kinney
Hi, I had a job that got stuck on yarn due to https://issues.apache.org/jira/browse/SPARK-6954 It never exited properly. Is there a way to set a timeout for a stage or all stages?

Re: Research ideas using spark

2015-07-15 Thread William Temperley
There seems to be a bit of confusion here - the OP (doing the PhD) had the thread hijacked by someone with a similar name asking a mundane question. It would be a shame to send someone away so rudely, who may do valuable work on Spark. Sashidar (not Sashid!) I'm personally interested in running g

Re: How to timeout a task?

2015-07-03 Thread William Ferrell
Ted, Thanks very much for your reply. It took me almost a week but I have finally had a chance to implement what you noted and it appears to be working locally. However, when I launch this onto a cluster on EC2 -- this doesn't work reliably. To expand, I think the issue is that some of the code w

Re: Killing Long running tasks (stragglers)

2015-06-25 Thread William Ferrell
ata. It will be helpful > if you can paste your code so that we will understand it better. > > Thanks > Best Regards > > On Wed, Jun 24, 2015 at 2:32 PM, William Ferrell > wrote: > >> Hello - >> >> I am using Apache Spark 1.2.1 via pyspark. Thanks to

Killing Long running tasks (stragglers)

2015-06-24 Thread William Ferrell
Hello - I am using Apache Spark 1.2.1 via pyspark. Thanks to any developers here for the great product! In my use case, I am running spark jobs to extract data from some raw data. Generally this works quite well. However, I am noticing that for certain data sets there are certain tasks that are

Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
ives a few suggestions on how to increase the number of partitions. -Will On Mon, Jun 15, 2015 at 5:00 PM, William Briggs wrote: > There are a lot of variables to consider. I'm not an expert on Spark, and > my ML knowledge is rudimentary at best, but here are some questions whose > answers m

Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
There are a lot of variables to consider. I'm not an expert on Spark, and my ML knowledge is rudimentary at best, but here are some questions whose answers might help us to help you: - What type of Spark cluster are you running (e.g., Stand-alone, Mesos, YARN)? - What does the HTTP UI tel

Re: Can a Spark App run with spark-submit write pdf files to HDFS

2015-06-09 Thread William Briggs
I don't know anything about your use case, so take this with a grain of salt, but typically if you are operating at a scale that benefits from Spark, then you likely will not want to write your output records as individual files into HDFS. Spark has built-in support for the Hadoop "SequenceFile" co

Re: SparkContext & Threading

2015-06-06 Thread William Briggs
Hi Lee, I'm stuck with only mobile devices for correspondence right now, so I can't get to shell to play with this issue - this is all supposition; I think that the lambdas are closing over the context because it's a constructor parameter to your Runnable class, which is why inlining the lambdas in

Re: Deduping events using Spark

2015-06-04 Thread William Briggs
Hi Lee, You should be able to create a PairRDD using the Nonce as the key, and the AnalyticsEvent as the value. I'm very new to Spark, but here is some uncompilable pseudo code that may or may not help: events.map(event => (event.getNonce, event)).reduceByKey((a, b) => a).map(_._2) The above cod

Re: Make HTTP requests from within Spark

2015-06-03 Thread William Briggs
Hi Kaspar, This is definitely doable, but in my opinion, it's important to remember that, at its core, Spark is based around a functional programming paradigm - you're taking input sets of data and, by applying various transformations, you end up with a dataset that represents your "answer". Witho

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-31 Thread William Cox
Ah, thanks for the help! That worked great. On Wed, Jul 30, 2014 at 10:31 AM, Zongheng Yang wrote: > To add to this: for this many (>= 20) machines I usually use at least > --wait 600. > > On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas > wrote: > > William, > >

Re: Shark/Spark running on EC2 can read from S3 bucket but cannot write to it - "Wrong FS"

2014-07-31 Thread William Cox
I am running Spark 0.9.1 and Shark 0.9.1. Sorry I didn't include that. On Thu, Jul 31, 2014 at 9:50 AM, William Cox wrote: > *The Shark-specific group appears to be in moderation pause, so I'm asking > here.* > > I'm running Shark/Spark on EC2. I am using Shark to qu

Shark/Spark running on EC2 can read from S3 bucket but cannot write to it - "Wrong FS"

2014-07-31 Thread William Cox
ng that I change to allow it to write to a S3 file system? I've tried all sorts of different queries to write to S3. This particular one was: > INSERT OVERWRITE DIRECTORY 's3n://id:key@shadoop/bucket' SELECT * FROM > table; Thanks for your help! -William

the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread William Cox
_ssh_tar) > File "./spark_ec2.py", line 640, in ssh_write > raise RuntimeError("ssh_write failed with error %s" % proc.returncode) > RuntimeError: ssh_write failed with error 255 So I log into the EC2 console and TERMINATE that specific machine, and re-resume. Now it finally appears to be installing software on the machines. Any ideas why certain machines refuse SSH connections, or why the master refuses for several minutes and then allows? THanks. -William

回复: which kind of BlockId should I use?

2014-07-20 Thread william
Internally, we just use the BlockId's natural hashing and equality to do lookups and puts, so it should work fine. However, since it is in no way public API, it may change even in maintenance releases. On Sun, Jul 20, 2014 at 10:25 PM, william wrote: When spark is 0.7.3, I use SparkEnv.

回复: What does @developerApi means?

2014-07-20 Thread william
thank you Stephen -- 原始邮件 -- 发件人: "Stephen Boesch";; 发送时间: 2014年7月21日(星期一) 中午11:55 收件人: "user"; 主题: Re: What does @developerApi means? The javaDoc seems reasonably helpful: /** * A lower-level, unstable API intended for developers. * * Developer API's mi

which kind of BlockId should I use?

2014-07-20 Thread william
When spark is 0.7.3, I use SparkEnv.get.blockManager.getLocal("model") and SparkEnv.get.blockManager.put("model", buf, StorageLevel.MEMORY_ONLY, false) to cached model object When I porting to spark 1.0.1, I found SparkEnv.get.blockManager.getLocal & ‍SparkEnv.get.blockManager.put's APIs chang

ETL and workflow management on Spark

2014-05-22 Thread William Kang
Hi, We are moving into adopting the full stack of Spark. So far, we have used Shark to do some ETL work, which is not bad but is not prefect either. We ended writing UDF and UDGF, UDAF that can be avoided if we could use Pig. Do you have any suggestions with the ETL solution in Spark stack? And d

Hadoop 2.3 Centralized Cache vs RDD

2014-05-16 Thread William Kang
Hi, Any comments or thoughts on the implications of the newly released feature from Hadoop 2.3 on the centralized cache? How different it is from RDD? Many thanks. Cao