Announcing the Community Over Code 2025 Streaming Track

2025-04-04 Thread James Hughes
Hi all, Community Over Code , the ASF conference, will be held in Minneapolis, Minnesota, September 11-14, 2025. The call for presentations is open now through April 21, 2025. I am one of the co-chairs for th

Announcing the Community Over Code 2024 Streaming Track

2024-03-20 Thread James Hughes
Hi all, Community Over Code , the ASF conference, will be held in Denver, Colorado, October 7-10, 2024. The call for presentations is open now

Re: Spark Connect, Master, and Workers

2023-09-01 Thread James Yu
Can I simply understand Spark Connect this way: The client process is now the Spark driver? From: Brian Huynh Sent: Thursday, August 10, 2023 10:15 PM To: Kezhi Xiong Cc: user@spark.apache.org Subject: Re: Spark Connect, Master, and Workers Hi Kezhi, Yes, you

[k8s] Fail to expose custom port on executor container specified in my executor pod template

2023-06-26 Thread James Yu
ports are exposed (5005/TCP, 7078/TCP, 7079/TCP, 4040/TCP) as expected. Did I miss anything, or is this a known bug where executor pod template is not respected in terms of the port expose? Thanks in advance for your help. James

Announcing the Community Over Code 2023 Streaming Track

2023-06-09 Thread James Hughes
Hi all, Community Over Code , the ASF conference, will be held in Halifax, Nova Scotia October 7-10, 2023. The call for presentations is open now through July 13, 2023. I am one of the co-chairs for the stream

Re: query time comparison to several SQL engines

2022-04-07 Thread James Turton
What might be the biggest factor affecting running time here is that Drill's query execution is not fault tolerant while Spark's is.  The philosophy is different, Drill's says "when you're doing interactive analytics and a node dies, killing your query as it goes, just run the query again." O

Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread James Yu
Question: Spark use log4j 1.2.17, if my application jar contains log4j 2.x and gets submitted to the Spark cluster. Which version of log4j gets actually used during the Spark session? From: Sean Owen Sent: Monday, December 13, 2021 8:25 AM To: Jörn Franke Cc: P

Re: start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-08 Thread James Yu
amages arising from such loss, damage or destruction. On Wed, 8 Dec 2021 at 19:45, James Yu mailto:ja...@ispot.tv>> wrote: Just thought about another possibility which is to containerize the history server and run the container with proper restart policy. This may be the approach we will

Re: start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-08 Thread James Yu
Sent: Tuesday, December 7, 2021 1:29 PM To: James Yu Cc: user @spark Subject: Re: start-history-server.sh doesn't survive system reboot. Recommendation? The scripts just launch the processes. To make any process restart on system restart, you would need to set it up as a system service

start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-07 Thread James Yu
Hi Users, We found that the history server launched by using the "start-history-server.sh" command does not survive system reboot. Any recommendation of making it always up even after reboot? Thanks, James

Re: Performance Problems Migrating to S3A Committers

2021-08-05 Thread James Yu
See this ticket https://issues.apache.org/jira/browse/HADOOP-17201. It may help your team. From: Johnny Burns Sent: Tuesday, June 22, 2021 3:41 PM To: user@spark.apache.org Cc: data-orchestration-team Subject: Performance Problems Migrating to S3A Committers H

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
rito Sent: Wednesday, February 3, 2021 11:05 AM To: James Yu ; user Subject: Re: Poor performance caused by coalesce to 1 Coalesce is reducing the parallelization of your last stage, in your case to 1 task. So, it’s natural it will give poor performance especially with large data. If you absol

Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
is a simple and useful way to solve this kind of issue which we believe is quite common for many people. Thanks James

Re: Where do the executors get my app jar from?

2020-08-14 Thread James Yu
Henoc, Ok. That is for Yarn with HDFS. What will happen in Kubernetes as resource manager without HDFS scenario? James From: Henoc Sent: Thursday, August 13, 2020 10:45 PM To: James Yu Cc: user ; russell.spit...@gmail.com Subject: Re: Where do the executors

Where do the executors get my app jar from?

2020-08-13 Thread James Yu
hanks in advance for explanation. James

Re: [Spark ML] existence of Matrix Factorization ALS algorithm's log version

2020-07-29 Thread James Yuan
Thanks for your quick reply. I'll hack it if needed :) James -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: NoClassDefFoundError: scala/Product$class

2020-06-06 Thread James Moore
How are you depending on that org.bdgenomics.adam library? Maybe you're pulling the 2.11 version of that.

Trump and modi butcher of Gujarat as Allies. Modi was banned to enter by US courts

2020-04-29 Thread James Mitchel

What is a VPN ? freedom from natzi owen censorship

2020-04-29 Thread James Mitchel

https://www.lausanne.org/content/lga/2019-05/the-rise-of-hindu-fundamentalism?gclid=Cj0KCQjwy6T1BRDXARIsAIqCTXpmVG-8QJwiOSTVH8fkhRXj3QXUufApRXbPJUTpLlZ4f4wWgFNlPVkaAndGEALw_wcB

2020-04-29 Thread James Mitchel
https://globalnews.ca/news/6823170/canadian-politicians-targeted-indian-intelligence/ Natzi Owen of Apache.org and two hindutwa against me.Characters who stole last remaining dignity from Apache tribe.Allies🤐 Abusing me😄 A price. Worth paying. I will chose different technology to put bread on t

Re: Spark driver thread

2020-03-06 Thread James Yu
Pol, thanks for your reply. Actually I am running Spark apps in CLUSTER mode. Is what you said still applicable in cluster mode. Thanks in advance for your further clarification. From: Pol Santamaria Sent: Friday, March 6, 2020 12:59 AM To: James Yu Cc: user

Spark driver thread

2020-03-05 Thread James Yu
Hi, Does a Spark driver always works as single threaded? If yes, does it mean asking for more than one vCPU for the driver is wasteful? Thanks, James

[Spark SQL] dependencies to use test helpers

2019-07-24 Thread James Pirz
ependencies += "org.apache.spark" % "spark-core_2.11" % "2.4.3" libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.4.3" libraryDependencies += "org.apache.spark" % "spark-catalyst_2.11" % "2.4.3" [1] sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala Thanks, James

Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread James Cotrotsios
Is there a plan to have a business catalog component for the Data Lake? If not how would someone make a proposal to create an open source project related to that. I would be interested in building out an open source data catalog that would use the Hive metadata store as a baseline for technical met

Parallel read parquet file, write to postgresql

2018-12-03 Thread James Starks
Reading Spark doc (https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's not mentioned how to parallel read parquet file with SparkSession. Would --num-executors just work? Any additional parameters needed to be added to SparkSession as well? Also if I want to parallel writ

Re: Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-12-03 Thread James Starks
{ ... }.filter{ ... }.flatMap { records => records.flatMap { record => Seq(record) } } Not smart code, but it works for my case. Thanks for the advice! ‐‐‐ Original Message ‐‐‐ On Saturday, December 1, 2018 12:17 PM, Chris Teoh wrote: > Hi James, > > Try flatMap (_.toL

Re: Caused by: java.io.NotSerializableException: com.softwaremill.sttp.FollowRedirectsBackend

2018-11-30 Thread James Starks
ich should lead to spark > being able to use the serializable versions. > > That’s very much a last resort though! > > Chris > > On 30 Nov 2018, at 05:08, Koert Kuipers wrote: > >> if you only use it in the executors sometimes using lazy works >> >>

Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-11-30 Thread James Starks
When processing data, I create an instance of RDD[Iterable[MyCaseClass]] and I want to convert it to RDD[MyCaseClass] so that it can be further converted to dataset or dataframe with toDS() function. But I encounter a problem that SparkContext can not be instantiated within SparkSession.map func

Caused by: java.io.NotSerializableException: com.softwaremill.sttp.FollowRedirectsBackend

2018-11-29 Thread James Starks
This is not problem directly caused by Spark, but it's related; thus asking here. I use spark to read data from parquet and processing some http call with sttp (https://github.com/softwaremill/sttp). However, spark throws Caused by: java.io.NotSerializableException: com.softwaremill.sttp.Fo

Re: Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
umentation more carefully because I believe you are a bit confused. > > regards, > > Apostolos > > On 07/09/2018 05:39 μμ, James Starks wrote: > > > Is df.write.mode(...).parquet("hdfs://..") also actions function? Checking > > doc shows that my spark d

Re: Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
park job can be reduced. Otherwise does Spark support streaming read from database (i.e. spark streaming + spark sql)? Thanks for your reply. ‐‐‐ Original Message ‐‐‐ On 7 September 2018 4:15 PM, Apostolos N. Papadopoulos wrote: > Dear James, > > - check the Spark documenta

Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
I have a Spark job that read data from database. By increasing submit parameter '--driver-memory 25g' the job can works without a problem locally but not in prod env because prod master do not have enough capacity. So I have a few questions: - What functions such as collecct() would cause the

Re: [External Sender] How to debug Spark job

2018-09-07 Thread James Starks
, Sep 7, 2018 at 5:48 AM James Starks > wrote: > >> I have a Spark job that reads from a postgresql (v9.5) table, and write >> result to parquet. The code flow is not complicated, basically >> >> case class MyCaseClass(field1: String, field2: St

How to debug Spark job

2018-09-07 Thread James Starks
I have a Spark job that reads from a postgresql (v9.5) table, and write result to parquet. The code flow is not complicated, basically case class MyCaseClass(field1: String, field2: String) val df = spark.read.format("jdbc")...load() df.createOrReplaceTempView(...) val newdf = spa

Re: Pass config file through spark-submit

2018-08-17 Thread James Starks
Accidentally to get it working, though don't thoroughly understand why (So far as I know, it's to configure in allowing executor refers to the conf file after copying to executors' working dir). Basically it's a combination of parameters --conf, --files, and --driver-class-path, instead of any s

Pass config file through spark-submit

2018-08-16 Thread James Starks
I have a config file that exploits type safe config library located on the local file system, and want to submit that file through spark-submit so that spark program can read customized parameters. For instance, my.app { db { host = domain.cc port = 1234 db = dbname user = myus

Data source jdbc does not support streamed reading

2018-08-08 Thread James Starks
Now my spark job can perform sql operations against database table. Next I want to combine that with streaming context, so switching to readStream() function. But after job submission, spark throws Exception in thread "main" java.lang.UnsupportedOperationException: Data source jdbc does no

Re: Newbie question on how to extract column value

2018-08-07 Thread James Starks
ation on url (id, derived_data) }.show() Thanks for the advice, it's really helpful! ‐‐‐ Original Message ‐‐‐ On August 7, 2018 5:33 PM, Gourav Sengupta wrote: > Hi James, > > It is always advisable to use the latest SPARK version. That said, can you > please gi

Newbie question on how to extract column value

2018-08-07 Thread James Starks
I am very new to Spark. Just successfully setup Spark SQL connecting to postgresql database, and am able to display table with code sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show() Now I want to perform filter and map function on col_b value. In plain scala it would

unsubscribe

2018-02-01 Thread James Casiraghi
unsubscribe

RE: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Gabriel James
Me too. Experiences and recommendations please. Gabriel From: Kevin Wang [mailto:buz...@gmail.com] Sent: Wednesday, April 12, 2017 6:11 AM To: Alonso Isidoro Roman Cc: Gaurav1809 ; user@spark.apache.org Subject: Re: Any NLP library for sentiment analysis in Spark? I am also interested

Anyone attending spark summit?

2016-10-11 Thread Andrew James
Hey, I just found a promo code for Spark Summit Europe that saves 20%. It’s "Summit16" - I love Brussels and just registered! Who’s coming with me to get their Spark on?! Cheers, Andrew

UNSUBSCRIBE

2016-08-09 Thread James Ding
smime.p7s Description: S/MIME cryptographic signature

Re: Spark, Scala, and DNA sequencing

2016-07-25 Thread James McCabe
me to look into interesting open-source projects like this. James On 24/07/16 09:09, Sean Owen wrote: Also also, you may be interested in GATK, built on Spark, for genomics: https://github.com/broadinstitute/gatk On Sun, Jul 24, 2016 at 7:56 AM, Ofir Manor wrote: Hi James, BTW - if yo

Spark, Scala, and DNA sequencing

2016-07-22 Thread James McCabe
Hi! I hope this may be of use/interest to someone: Spark, a Worked Example: Speeding Up DNA Sequencing http://scala-bility.blogspot.nl/2016/07/spark-worked-example-speeding-up-dna.html James - To unsubscribe e-mail: user

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread James Hammerton
This may be related to: https://issues.apache.org/jira/browse/SPARK-13773 Regards, James On 11 May 2016 at 15:49, Ted Yu wrote: > In master branch, behavior is the same. > > Suggest opening a JIRA if you haven't done so. > > On Wed, May 11, 2016 at 6:55 AM, Tony Jin

Help understanding an exception that produces multiple stack traces

2016-05-09 Thread James Casiraghi
the lazy evaluation, and that only actions will do this, but the initial stack trace seems to be showing a persist call with underlying executing work. -Thank you. -James Stack Trace: An error occurred while calling o236.persist. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException

Re: Error from reading S3 in Scala

2016-05-04 Thread James Hammerton
ecause of what's said about s3:// and s3n:// here (which is why I use s3a://): https://wiki.apache.org/hadoop/AmazonS3 Regards, James > Besides that you can increase s3 speeds using the instructions mentioned > here: > https://aws.amazon.com/blogs/aws/aws-storage-update-ama

Will nested field performance improve?

2016-04-15 Thread James Aley
be keen to address that in our ETL pipeline with a flattening step. If it's a known issue that we expect will be fixed in upcoming releases, I'll hold off. Any advice greatly appreciated! Thanks, James.

Re: ML Random Forest Classifier

2016-04-13 Thread James Hammerton
Hi Ashic, Unfortunately I don't know how to work around that - I suggested this line as it looked promising (I had considered it once before deciding to use a different algorithm) but I never actually tried it. Regards, James On 13 April 2016 at 02:29, Ashic Mahtab wrote: > It looks

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton
tegoricalFeatures, numClasses, numFeatures) > > } > > > def toOld(newModel: RandomForestClassificationModel): > OldRandomForestModel = { > > newModel.toOld > > } > > } > Regards, James On 11 April 2016 at 10:36, James Hammerton wrote: > There are met

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton
27;ve not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab wrote: > Hello, > I'm trying to save a pipeline with a random forest classifier. If I try to > save the pipeline, it complains that the classifier i

Logistic regression throwing errors

2016-04-01 Thread James Hammerton
errors cause the learning to fail - f1 = 0. Anyone got any idea why this might happen? Regards, James

Re: Work out date column in CSV more than 6 months old (datediff or something)

2016-03-22 Thread James Hammerton
t;22/03/2016" < "24/02/2015" > > res4: Boolean = true > > >> scala> "22/03/2016" < "04/02/2015" > > res5: Boolean = false > > This is the correct result for a string comparison but it's not the comparison y

Re: Find all invoices more than 6 months from csv file

2016-03-22 Thread James Hammerton
e an add_month(), date_add() and date_sub() methods, the first adds a number of months to a start date (would adding a -ve number of months to the current date work?), the latter two add or subtract a specified number of days to/from a date, these are available in 1.5.0 onwards. Alternatively outside

Add org.apache.spark.mllib model .predict() method to models in org.apache.spark.ml?

2016-03-22 Thread James Hammerton
being used outside of Spark than the new models at this time. Are there any plans to add the .predict() method back to the models in the new API? Regards, James

Re: best way to do deep learning on spark ?

2016-03-20 Thread James Hammerton
In the meantime there is also deeplearning4j which integrates with Spark (for both Java and Scala): http://deeplearning4j.org/ Regards, James On 17 March 2016 at 02:32, Ulanov, Alexander wrote: > Hi Charles, > > > > There is an implementation of multilayer perceptron in S

Saving the DataFrame based RandomForestClassificationModels

2016-03-18 Thread James Hammerton
issues.apache.org/jira/browse/SPARK-11888 My question is whether there's a work around given that these bugs are unresolved at least until 2.0.0. Regards, James

Best way to process values for key in sorted order

2016-03-15 Thread James Hammerton
erved on reading the events in, this should work. Anyone know definitively if this is the case? Regards, James

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-09 Thread James Hammerton
Hi Ted, Finally got round to creating this: https://issues.apache.org/jira/browse/SPARK-13773 I hope you don't mind me selecting you as the shepherd for this ticket. Regards, James On 7 March 2016 at 17:50, James Hammerton wrote: > Hi Ted, > > Thanks for getting back -

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton
e to choose the project. Regards, James On 7 March 2016 at 13:09, Ted Yu wrote: > Have you tried clicking on Create button from an existing Spark JIRA ? > e.g. > https://issues.apache.org/jira/browse/SPARK-4352 > > Once you're logged in, you should be able to select Spark as

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton
pache Infrastructure. There doesn't seem to be an option for me to raise an issue for Spark?! Regards, James On 4 March 2016 at 14:03, James Hammerton wrote: > Sure thing, I'll see if I can isolate this. > > Regards. > > James > > On 4 March 2016 at 12:24, Ted Yu

Spark reduce serialization question

2016-03-04 Thread James Jia
5.5MB, which is approximately 4 * 330 MB. I know I can set the driver's max result size, but I just want to confirm that this is expected behavior. Thanks! James Stage 0:==>(1 + 3) / 4]16/02/19 05:59:28 ERROR TaskSetManager:

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread James Hammerton
Sure thing, I'll see if I can isolate this. Regards. James On 4 March 2016 at 12:24, Ted Yu wrote: > If you can reproduce the following with a unit test, I suggest you open a > JIRA. > > Thanks > > On Mar 4, 2016, at 4:01 AM, James Hammerton wrote: > > Hi, >

DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread James Hammerton
rmats(NoTypeHints) > > val created = Serialization.read[GMailMessage.Created](eventJson) // > This is where the code crashes if the .cache isn't called Regards, James

Re: How to control the number of parquet files getting created under a partition ?

2016-03-02 Thread James Hammerton
r of partitions in the DataFrame by using coalesce() before saving the data. Regards, James On 1 March 2016 at 21:01, SRK wrote: > Hi, > > How can I control the number of parquet files getting created under a > partition? I have my sqlContext queries to create a table and inse

Re: Sample sql query using pyspark

2016-03-01 Thread James Barney
Maurin, I don't know the technical reason why but: try removing the 'limit 100' part of your query. I was trying to do something similar the other week and what I found is that each executor doesn't necessarily get the same 100 rows. Joins would fail or result with a bunch of nulls when keys weren

Re: How could I do this algorithm in Spark?

2016-02-24 Thread James Barney
Guillermo, I think you're after an associative algorithm where A is ultimately associated with D, correct? Jakob would correct if that is a typo--a sort would be all that is necessary in that case. I believe you're looking for something else though, if I understand correctly. This seems like a si

Count job stalling at shuffle stage on 3.4TB input (but only 5.3GB shuffle write)

2016-02-23 Thread James Hammerton
ut=[]) > TungstenExchange hashpartitioning(objectId#0) > TungstenAggregate(key=[objectId#0], functions=[], output=[objectId#0]) > Scan > CsvRelation(,Some(s3n://gluru-research/data/events.prod.2016-02-04/extractedEventsUncompressed),false, > > ,",null,#,PERMISSIVE,COMMONS,false,false,false,StructType(StructField(objectId,StringType,true), > StructField(eventName,StringType,true), > StructField(eventJson,StringType,true), > StructField(timestampNanos,StringType,true)),false,null)[objectId#0] > > Code Generation: true > > Regards, James

Re: Is this likely to cause any problems?

2016-02-19 Thread James Hammerton
http://spark.apache.org/docs/latest/index.html) mentions EC2 but not EMR. Regards, James On 19 February 2016 at 14:25, Daniel Siegmann wrote: > With EMR supporting Spark, I don't see much reason to use the spark-ec2 > script unless it is important for you to be able to launch clus

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
I have now... So far I think the issues I've had are not related to this, but I wanted to be sure in case it should be something that needs to be patched. I've had some jobs run successfully but this warning appears in the logs. Regards, James On 18 February 2016 at 12:23, Ted

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
I'm fairly new to Spark. The documentation suggests using the spark-ec2 script to launch clusters in AWS, hence I used it. Would EMR offer any advantage? Regards, James On 18 February 2016 at 14:04, Gourav Sengupta wrote: > Hi, > > Just out of sheet curiosity why are you no

Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
2xlarge and master is m4.large. Could this contribute to any problems running the jobs? Regards, James

pyspark.DataFrame.dropDuplicates

2016-02-12 Thread James Barney
s which row is kept and which is deleted? First to appear? Or random? I would like to guarantee that the row with the longest list itemsInPocket is kept. How can I do that? Thanks, James

Re: Extract all the values from describe

2016-02-08 Thread James Barney
Hi Arunkumar, >From the scala documentation it's recommended to use the agg function for performing any actual statistics programmatically on your data. df.describe() is meant only for data exploration. See Aggregator here: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.

Re: Recursive nested wildcard directory walking in Spark

2015-12-09 Thread James Ding
/*.gz). Any thoughts or workarounds? I’m considering using bash globbing to match files recursively and feed hundreds of thousands of arguments to spark-submit. Reasons for/against? From: Ted Yu Date: Wednesday, December 9, 2015 at 3:50 PM To: James Ding Cc: "user@spark.apache.org"

Recursive nested wildcard directory walking in Spark

2015-12-09 Thread James Ding
Hi! My name is James, and I’m working on a question there doesn’t seem to be a lot of answers about online. I was hoping spark/hadoop gurus could shed some light on this. I have a data feed on NFS that looks like /foobar/.gz Currently I have a spark scala job that calls sparkContext.textFile

Re: No suitable drivers found for postgresql

2015-11-13 Thread James Nowell
path /usr/local/share/jupyter/kernels/postgres/postgresql-9.4-1204.jdbc42.jar --executor-memory 1G --total-executor-cores 15 pyspark-shell" James On Fri, Nov 13, 2015 at 12:12 PM Krishna Sangeeth KS < kskrishnasange...@gmail.com> wrote: > ​​ > ​Hi,​ > > I have been trying t

Re: No suitable drivers found for postgresql

2015-11-13 Thread James Nowell
I recently had this same issue. Though I didn't find the cause, I was able to work around it by loading the JAR into hdfs. Once in HDFS, I used the --jars flag with the full hdfs path: --jars hdfs://{our namenode}/tmp/postgresql-9.4-1204-jdbc42.jar James On Fri, Nov 13, 2015 at 10:14 AM s

Re: Setting executors per worker - Standalone

2015-09-29 Thread James Pirz
environment tab on the Application UI > > --- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > >

Re: Setting executors per worker - Standalone

2015-09-28 Thread James Pirz
have 4 cores per worker > > > > On Tue, Sep 29, 2015 at 8:24 AM, James Pirz wrote: > >> Hi, >> >> I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while >> each machine has 12GB of RAM and 4 cores. On each machine I have one worker >&g

Setting executors per worker - Standalone

2015-09-28 Thread James Pirz
Hi, I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while each machine has 12GB of RAM and 4 cores. On each machine I have one worker which is running one executor that grabs all 4 cores. I am interested to check the performance with "one worker but 4 executors per machine - each

Unreachable dead objects permanently retained on heap

2015-09-25 Thread James Aley
bout this? Is there a way to reclaim this memory? Should those arrays be GC'ed when jobs finish? Any guidance greatly appreciated. Many thanks, James.

Java UDFs in GROUP BY expressions

2015-09-07 Thread James Aley
illustrate the issue. The equivalent code from Scala seems to work fine for me. Is anyone else seeing this problem? For us, the attached code fails every time on Spark 1.4.1 Thanks, James

RE: Feedback: Feature request

2015-08-28 Thread Murphy, James
This is great and much appreciated. Thank you. - Jim From: Manish Amde [mailto:manish...@gmail.com] Sent: Friday, August 28, 2015 9:20 AM To: Cody Koeninger Cc: Murphy, James; user@spark.apache.org; d...@spark.apache.org Subject: Re: Feedback: Feature request Sounds good. It's a request I

Feedback: Feature request

2015-08-26 Thread Murphy, James
Hey all, In working with the DecisionTree classifier, I found it difficult to extract rules that could easily facilitate visualization with libraries like D3. So for example, using : print(model.toDebugString()), I get the following result = If (feature 0 <= -35.0) If (feature 24 <= 176.0

Repartitioning external table in Spark sql

2015-08-18 Thread James Pirz
I am using Spark 1.4.1 , in stand-alone mode, on a cluster of 3 nodes. Using Spark sql and Hive Context, I am trying to run a simple scan query on an existing Hive table (which is an external table consisting of rows in text files stored in HDFS - it is NOT parquet, ORC or any other richer format)

Re: worker and executor memory

2015-08-14 Thread James Pirz
scheduled that way, as it is a map-only job and reading can happen in parallel. On Thu, Aug 13, 2015 at 9:10 PM, James Pirz wrote: > Hi, > > I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, > for a workload similar to TPCH (analytical queries with multiple/multi

worker and executor memory

2015-08-13 Thread James Pirz
Hi, I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables), a

Re: SparkSQL: "add jar" blocks all queries

2015-08-07 Thread Wu, James C.
Hi, The issue only seems to happen when trying to access spark via the SparkSQL Thrift Server interface. Does anyone know a fix? james From: , Walt Disney mailto:james.c...@disney.com>> Date: Friday, August 7, 2015 at 12:40 PM To: "user@spark.apache.org<mailto:user@sp

SparkSQL: "add jar" blocks all queries

2015-08-07 Thread Wu, James C.
Hi, I got into a situation where a prior "add jar " command causing Spark SQL stops to work for all users. Does anyone know how to fix the issue? Regards, james From: , Walt Disney mailto:james.c...@disney.com>> Date: Friday, August 7, 2015 at 10:29 AM To: "user@spark.a

SparkSQL: remove jar added by "add jar " command from dependencies

2015-08-07 Thread Wu, James C.
get the jar removed from the dependencies so that It is not blocking all my spark sql queries for all sessions. Thanks, James

[POWERED BY] Please add our organization

2015-07-23 Thread Baxter, James
quality analysis and data exploration. Regards James Baxter Technology and Innovation Analyst IS&S Woodside Energy Ltd. Woodside Plaza 240 St Georges Terrace Perth WA 6000 Australia T: +61 8 9348 4218 F: +61 8 9348 6561 E: james.bax...@woodside.com.au<mailto:james.bax...@woodside.com.au>

Streaming: updating broadcast variables

2015-07-02 Thread James Cole
etter way than re-creating the JavaStreamingContext and DStreams? Thanks, James

Re: Help optimising Spark SQL query

2015-06-30 Thread James Aley
torical data to go sifting through. Turns out we're already writing our data as //, we just missed the "date=" naming convention - d'oh! At least that means a fairly simple rename script should get us out of trouble! Appreciate everyone's tips, thanks again! James. On 23

Re: Help optimising Spark SQL query

2015-06-23 Thread James Aley
se seem to have made any remarkable difference in running time for the query. I'll hook up YourKit and see if we can figure out where the CPU time is going, then post back. On 22 June 2015 at 16:01, Yin Huai wrote: > Hi James, > > Maybe it's the DISTINCT causing the issue. >

Re: Help optimising Spark SQL query

2015-06-22 Thread James Aley
Thanks for the responses, guys! Sorry, I forgot to mention that I'm using Spark 1.3.0, but I'll test with 1.4.0 and try the codegen suggestion then report back. On 22 June 2015 at 12:37, Matthew Johnson wrote: > Hi James, > > > > What version of Spark are you using

Help optimising Spark SQL query

2015-06-22 Thread James Aley
e and couldn't find anyone else having similar issues. Many thanks, James.

Re: Optimisation advice for Avro->Parquet merge job

2015-06-12 Thread James Aley
Hey Kiran, Thanks very much for the response. I left for vacation before I could try this out, but I'll experiment once I get back and let you know how it goes. Thanks! James. On 8 June 2015 at 12:34, kiran lonikar wrote: > It turns out my assumption on load and unionAll being blo

Re: spark-submit does not use hive-site.xml

2015-06-10 Thread James Pirz
to communicate with Hive metastore. > > So your program need to instantiate a > `org.apache.spark.sql.hive.HiveContext` instead. > > Cheng > > > On 6/10/15 10:19 AM, James Pirz wrote: > > I am using Spark (standalone) to run queries (from a remote client) > against d

  1   2   3   >