[SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-06 Thread James
Hello, I want to execute a hql script through `spark-sql` command, my script contains: ``` ALTER TABLE xxx DROP PARTITION (date_key = ${hiveconf:CUR_DATE}); ``` when I execute ``` spark-sql -f script.hql -hiveconf CUR_DATE=20150119 ``` It throws an error like ``` cannot recognize input near '$

Re: [SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-08 Thread James
Hi, It still don't work. Is there any success instruction about how to pass a date to a hql script? Alcaid 2015-03-07 2:43 GMT+08:00 Zhan Zhang : > Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash) > > Thanks. > > Zhan Zhang > > On Mar 6, 2015

How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread James
Hello, I am got a cluster with spark on yarn. Currently some nodes of it are running a spark streamming program, thus their local space is not enough to support other application. Thus I wonder is that possible to use a blacklist to avoid using these nodes when running a new spark program? Alcaid

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread James
My hadoop version is 2.2.0, and my spark version is 1.2.0 2015-03-14 17:22 GMT+08:00 Ted Yu : > Which release of hadoop are you using ? > > Can you utilize node labels feature ? > See YARN-2492 and YARN-796 > > Cheers > > On Sat, Mar 14, 2015 at 1:49 AM, James wrote: &g

Null Pointer Exception due to mapVertices function in GraphX

2015-03-15 Thread James
I have got NullPointerException in aggregateMessages on a graph which is the output of mapVertices function of a graph. I found the problem is because of the mapVertices funciton did not affect all the triplet of the graph. // Initial the graph, assign a counter to each vertex that contains the ve

Clean the shuffle data during iteration

2015-03-20 Thread James
Hello, Is that possible to delete shuffle data of previous iteration as it is not necessary? Alcaid

Using DIMSUM with ids

2015-04-06 Thread James
The example below illustrates how to use the DIMSUM algorithm to calculate the similarity between each two rows and output row pairs with cosine simiarity that is not less than a threshold. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSi

[GraphX] aggregateMessages with active set

2015-04-07 Thread James
Hello, The old api of GraphX "mapReduceTriplets" has an optional parameter "activeSetOpt: Option[(VertexRDD[_]" that limit the input of sendMessage. However, to the new api "aggregateMessages" I could not find this option, why it does not offer any more? Alcaid

Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread James
].aggregateMessagesWithActiveSet(...) > Ankur > > > On Tue, Apr 7, 2015 at 2:56 AM, James wrote: > > Hello, > > > > The old api of GraphX "mapReduceTriplets" has an optional parameter > > "activeSetOpt: Option[(VertexRDD[_]" that limit the input o

Re: [GraphX] aggregateMessages with active set

2015-04-13 Thread James
tps://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L237-266 > > Ankur > > > On Thu, Apr 9, 2015 at 3:21 AM, James wrote: > > In aggregateMessagesWithActiveSet, Spark still have to read all edges. It > > mea

How to correctly extimate the number of partition of a graph in GraphX

2014-11-01 Thread James
Hello, I am trying to run Connected Component algorithm on a very big graph. In practice I found that a small number of partition size would lead to OOM, while a large number would cause various time out exceptions. Thus I wonder how to estimate the number of partition of a graph in GraphX? Alcai

Re: How to correctly extimate the number of partition of a graph in GraphX

2014-11-01 Thread James
nkurdave.com/> > > On Sat, Nov 1, 2014 at 10:57 PM, James wrote: > >> Hello, >> >> I am trying to run Connected Component algorithm on a very big graph. In >> practice I found that a small number of partition size would lead to OOM, >> while a large number wou

Bug in DISK related Storage level?

2014-11-03 Thread James
Hello, I am trying to load a very large graph to run a GraphX algorithm, and the graph is not fix the memory, I found that if I use DISK_ONLY or MEMORY_AND_DISK_SER storage level, the program will met OOM, but if I use MEMORY_ONLY_SER, the program will not. Thus I want to know what kind of differe

Using graphx to calculate average distance of a big graph

2015-01-04 Thread James
Recently we want to use spark to calculate the average shortest path distance between each reachable pair of nodes in a very big graph. Is there any one ever try this? We hope to discuss about the problem.

Re: Using graphx to calculate average distance of a big graph

2015-01-06 Thread James
hortest path is an option, you could simply > find the APSP using https://github.com/apache/spark/pull/3619 and then > take the average distance (apsp.map(_._2.toDouble).mean). > > Ankur <http://www.ankurdave.com/> > > On Sun, Jan 4, 2015 at 6:28 PM, James wrote: > >&g

Re: [Spark ML] existence of Matrix Factorization ALS algorithm's log version

2020-07-29 Thread James Yuan
Thanks for your quick reply. I'll hack it if needed :) James -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Where do the executors get my app jar from?

2020-08-13 Thread James Yu
hanks in advance for explanation. James

Re: Where do the executors get my app jar from?

2020-08-14 Thread James Yu
Henoc, Ok. That is for Yarn with HDFS. What will happen in Kubernetes as resource manager without HDFS scenario? James From: Henoc Sent: Thursday, August 13, 2020 10:45 PM To: James Yu Cc: user ; russell.spit...@gmail.com Subject: Re: Where do the executors

Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
is a simple and useful way to solve this kind of issue which we believe is quite common for many people. Thanks James

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
rito Sent: Wednesday, February 3, 2021 11:05 AM To: James Yu ; user Subject: Re: Poor performance caused by coalesce to 1 Coalesce is reducing the parallelization of your last stage, in your case to 1 task. So, it’s natural it will give poor performance especially with large data. If you absol

Re: Performance Problems Migrating to S3A Committers

2021-08-05 Thread James Yu
See this ticket https://issues.apache.org/jira/browse/HADOOP-17201. It may help your team. From: Johnny Burns Sent: Tuesday, June 22, 2021 3:41 PM To: user@spark.apache.org Cc: data-orchestration-team Subject: Performance Problems Migrating to S3A Committers H

start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-07 Thread James Yu
Hi Users, We found that the history server launched by using the "start-history-server.sh" command does not survive system reboot. Any recommendation of making it always up even after reboot? Thanks, James

Re: start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-08 Thread James Yu
Sent: Tuesday, December 7, 2021 1:29 PM To: James Yu Cc: user @spark Subject: Re: start-history-server.sh doesn't survive system reboot. Recommendation? The scripts just launch the processes. To make any process restart on system restart, you would need to set it up as a system service

Re: start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-08 Thread James Yu
amages arising from such loss, damage or destruction. On Wed, 8 Dec 2021 at 19:45, James Yu mailto:ja...@ispot.tv>> wrote: Just thought about another possibility which is to containerize the history server and run the container with proper restart policy. This may be the approach we will

Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread James Yu
Question: Spark use log4j 1.2.17, if my application jar contains log4j 2.x and gets submitted to the Spark cluster. Which version of log4j gets actually used during the Spark session? From: Sean Owen Sent: Monday, December 13, 2021 8:25 AM To: Jörn Franke Cc: P

Re: query time comparison to several SQL engines

2022-04-07 Thread James Turton
What might be the biggest factor affecting running time here is that Drill's query execution is not fault tolerant while Spark's is.  The philosophy is different, Drill's says "when you're doing interactive analytics and a node dies, killing your query as it goes, just run the query again." O

Announcing the Community Over Code 2023 Streaming Track

2023-06-09 Thread James Hughes
Hi all, Community Over Code , the ASF conference, will be held in Halifax, Nova Scotia October 7-10, 2023. The call for presentations is open now through July 13, 2023. I am one of the co-chairs for the stream

[k8s] Fail to expose custom port on executor container specified in my executor pod template

2023-06-26 Thread James Yu
ports are exposed (5005/TCP, 7078/TCP, 7079/TCP, 4040/TCP) as expected. Did I miss anything, or is this a known bug where executor pod template is not respected in terms of the port expose? Thanks in advance for your help. James

Re: Spark Connect, Master, and Workers

2023-09-01 Thread James Yu
Can I simply understand Spark Connect this way: The client process is now the Spark driver? From: Brian Huynh Sent: Thursday, August 10, 2023 10:15 PM To: Kezhi Xiong Cc: user@spark.apache.org Subject: Re: Spark Connect, Master, and Workers Hi Kezhi, Yes, you

Announcing the Community Over Code 2024 Streaming Track

2024-03-20 Thread James Hughes
Hi all, Community Over Code , the ASF conference, will be held in Denver, Colorado, October 7-10, 2024. The call for presentations is open now

Optimisation advice for Avro->Parquet merge job

2015-06-04 Thread James Aley
resistance for working with Avro and Parquet. Would there be any advantage to using hadoopRDD() with the appropriate Input/Output formats? Any advice or tips greatly appreciated! James.

Re: Optimisation advice for Avro->Parquet merge job

2015-06-04 Thread James Aley
nfident. It seemed to help quite substantially anyway, so perhaps this just needs further tuning? * Increasing executors, RAM, etc. This doesn't make a difference by itself for this job, so I'm thinking we're already not fully utilising the resources we have in a smaller cluster. Again

Running SparkSql against Hive tables

2015-06-05 Thread James Pirz
I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 'Spark SQL' to run some SQL scripts, on the cluster. I realized that for a better performance, it is a good idea to use Parquet files. I have 2 questions regarding that: 1) If I wanna use Spark SQL against *partitioned & bucketed

spark ssh to slave

2015-06-08 Thread James King
I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker) These two hosts have exchanged public keys so they have free access to each other. But when I do /sbin/start-all.sh from 192.168.1.15 I still get 192.168.1.16: Permission denied (publickey,gssapi-keyex,gssapi-with-mic). any though

Re: spark ssh to slave

2015-06-08 Thread James King
Thanks Akhil, yes that works fine it just lets me straight in. On Mon, Jun 8, 2015 at 11:58 AM, Akhil Das wrote: > Can you do *ssh -v 192.168.1.16* from the Master machine and make sure > its able to login without password? > > Thanks > Best Regards > > On Mon, Jun 8, 2

Re: Running SparkSql against Hive tables

2015-06-08 Thread James Pirz
o) - Any suggestion or hint on how I can do that would be highly appreciated. Thnx On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian wrote: > > > On 6/6/15 9:06 AM, James Pirz wrote: > > I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 'Spark > SQL' t

Re: Running SparkSql against Hive tables

2015-06-09 Thread James Pirz
t; > What you currently doing is using beeline to connect to hive, which should > work even without spark. > > Best > Ayan > > On Tue, Jun 9, 2015 at 10:42 AM, James Pirz wrote: > >> Thanks for the help! >> I am actually trying Spark SQL to run queries against table

Re: Running SparkSql against Hive tables

2015-06-09 Thread James Pirz
ing a query file with -f flag). Looking at the Spark SQL documentation, it seems that it is possible. Please correct me if I am wrong. On Mon, Jun 8, 2015 at 6:56 PM, Cheng Lian wrote: > > On 6/9/15 8:42 AM, James Pirz wrote: > > Thanks for the help! > I am actually trying Spark SQL to

spark-submit does not use hive-site.xml

2015-06-09 Thread James Pirz
I am using Spark (standalone) to run queries (from a remote client) against data in tables that are already defined/loaded in Hive. I have started metastore service in Hive successfully, and by putting hive-site.xml, with proper metastore.uri, in $SPARK_HOME/conf directory, I tried to share its co

Re: spark-submit does not use hive-site.xml

2015-06-10 Thread James Pirz
to communicate with Hive metastore. > > So your program need to instantiate a > `org.apache.spark.sql.hive.HiveContext` instead. > > Cheng > > > On 6/10/15 10:19 AM, James Pirz wrote: > > I am using Spark (standalone) to run queries (from a remote client) > against d

Re: Optimisation advice for Avro->Parquet merge job

2015-06-12 Thread James Aley
Hey Kiran, Thanks very much for the response. I left for vacation before I could try this out, but I'll experiment once I get back and let you know how it goes. Thanks! James. On 8 June 2015 at 12:34, kiran lonikar wrote: > It turns out my assumption on load and unionAll being blo

Help optimising Spark SQL query

2015-06-22 Thread James Aley
e and couldn't find anyone else having similar issues. Many thanks, James.

Re: Help optimising Spark SQL query

2015-06-22 Thread James Aley
Thanks for the responses, guys! Sorry, I forgot to mention that I'm using Spark 1.3.0, but I'll test with 1.4.0 and try the codegen suggestion then report back. On 22 June 2015 at 12:37, Matthew Johnson wrote: > Hi James, > > > > What version of Spark are you using

Re: Help optimising Spark SQL query

2015-06-23 Thread James Aley
se seem to have made any remarkable difference in running time for the query. I'll hook up YourKit and see if we can figure out where the CPU time is going, then post back. On 22 June 2015 at 16:01, Yin Huai wrote: > Hi James, > > Maybe it's the DISTINCT causing the issue. >

Re: Help optimising Spark SQL query

2015-06-30 Thread James Aley
torical data to go sifting through. Turns out we're already writing our data as //, we just missed the "date=" naming convention - d'oh! At least that means a fairly simple rename script should get us out of trouble! Appreciate everyone's tips, thanks again! James. On 23

Streaming: updating broadcast variables

2015-07-02 Thread James Cole
etter way than re-creating the JavaStreamingContext and DStreams? Thanks, James

DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread James Hammerton
rmats(NoTypeHints) > > val created = Serialization.read[GMailMessage.Created](eventJson) // > This is where the code crashes if the .cache isn't called Regards, James

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread James Hammerton
Sure thing, I'll see if I can isolate this. Regards. James On 4 March 2016 at 12:24, Ted Yu wrote: > If you can reproduce the following with a unit test, I suggest you open a > JIRA. > > Thanks > > On Mar 4, 2016, at 4:01 AM, James Hammerton wrote: > > Hi, >

Spark reduce serialization question

2016-03-04 Thread James Jia
5.5MB, which is approximately 4 * 330 MB. I know I can set the driver's max result size, but I just want to confirm that this is expected behavior. Thanks! James Stage 0:==>(1 + 3) / 4]16/02/19 05:59:28 ERROR TaskSetManager:

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton
pache Infrastructure. There doesn't seem to be an option for me to raise an issue for Spark?! Regards, James On 4 March 2016 at 14:03, James Hammerton wrote: > Sure thing, I'll see if I can isolate this. > > Regards. > > James > > On 4 March 2016 at 12:24, Ted Yu

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton
e to choose the project. Regards, James On 7 March 2016 at 13:09, Ted Yu wrote: > Have you tried clicking on Create button from an existing Spark JIRA ? > e.g. > https://issues.apache.org/jira/browse/SPARK-4352 > > Once you're logged in, you should be able to select Spark as

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-09 Thread James Hammerton
Hi Ted, Finally got round to creating this: https://issues.apache.org/jira/browse/SPARK-13773 I hope you don't mind me selecting you as the shepherd for this ticket. Regards, James On 7 March 2016 at 17:50, James Hammerton wrote: > Hi Ted, > > Thanks for getting back -

Best way to process values for key in sorted order

2016-03-15 Thread James Hammerton
erved on reading the events in, this should work. Anyone know definitively if this is the case? Regards, James

Saving the DataFrame based RandomForestClassificationModels

2016-03-18 Thread James Hammerton
issues.apache.org/jira/browse/SPARK-11888 My question is whether there's a work around given that these bugs are unresolved at least until 2.0.0. Regards, James

Re: best way to do deep learning on spark ?

2016-03-20 Thread James Hammerton
In the meantime there is also deeplearning4j which integrates with Spark (for both Java and Scala): http://deeplearning4j.org/ Regards, James On 17 March 2016 at 02:32, Ulanov, Alexander wrote: > Hi Charles, > > > > There is an implementation of multilayer perceptron in S

Add org.apache.spark.mllib model .predict() method to models in org.apache.spark.ml?

2016-03-22 Thread James Hammerton
being used outside of Spark than the new models at this time. Are there any plans to add the .predict() method back to the models in the new API? Regards, James

Re: Find all invoices more than 6 months from csv file

2016-03-22 Thread James Hammerton
e an add_month(), date_add() and date_sub() methods, the first adds a number of months to a start date (would adding a -ve number of months to the current date work?), the latter two add or subtract a specified number of days to/from a date, these are available in 1.5.0 onwards. Alternatively outside

Re: Work out date column in CSV more than 6 months old (datediff or something)

2016-03-22 Thread James Hammerton
t;22/03/2016" < "24/02/2015" > > res4: Boolean = true > > >> scala> "22/03/2016" < "04/02/2015" > > res5: Boolean = false > > This is the correct result for a string comparison but it's not the comparison y

Logistic regression throwing errors

2016-04-01 Thread James Hammerton
errors cause the learning to fail - f1 = 0. Anyone got any idea why this might happen? Regards, James

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton
27;ve not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab wrote: > Hello, > I'm trying to save a pipeline with a random forest classifier. If I try to > save the pipeline, it complains that the classifier i

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton
tegoricalFeatures, numClasses, numFeatures) > > } > > > def toOld(newModel: RandomForestClassificationModel): > OldRandomForestModel = { > > newModel.toOld > > } > > } > Regards, James On 11 April 2016 at 10:36, James Hammerton wrote: > There are met

Re: ML Random Forest Classifier

2016-04-13 Thread James Hammerton
Hi Ashic, Unfortunately I don't know how to work around that - I suggested this line as it looked promising (I had considered it once before deciding to use a different algorithm) but I never actually tried it. Regards, James On 13 April 2016 at 02:29, Ashic Mahtab wrote: > It looks

Will nested field performance improve?

2016-04-15 Thread James Aley
be keen to address that in our ETL pipeline with a flattening step. If it's a known issue that we expect will be fixed in upcoming releases, I'll hold off. Any advice greatly appreciated! Thanks, James.

Re: Error from reading S3 in Scala

2016-05-04 Thread James Hammerton
ecause of what's said about s3:// and s3n:// here (which is why I use s3a://): https://wiki.apache.org/hadoop/AmazonS3 Regards, James > Besides that you can increase s3 speeds using the instructions mentioned > here: > https://aws.amazon.com/blogs/aws/aws-storage-update-ama

Help understanding an exception that produces multiple stack traces

2016-05-09 Thread James Casiraghi
the lazy evaluation, and that only actions will do this, but the initial stack trace seems to be showing a persist call with underlying executing work. -Thank you. -James Stack Trace: An error occurred while calling o236.persist. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread James Hammerton
This may be related to: https://issues.apache.org/jira/browse/SPARK-13773 Regards, James On 11 May 2016 at 15:49, Ted Yu wrote: > In master branch, behavior is the same. > > Suggest opening a JIRA if you haven't done so. > > On Wed, May 11, 2016 at 6:55 AM, Tony Jin

Spark, Scala, and DNA sequencing

2016-07-22 Thread James McCabe
Hi! I hope this may be of use/interest to someone: Spark, a Worked Example: Speeding Up DNA Sequencing http://scala-bility.blogspot.nl/2016/07/spark-worked-example-speeding-up-dna.html James - To unsubscribe e-mail: user

Re: Spark, Scala, and DNA sequencing

2016-07-25 Thread James McCabe
me to look into interesting open-source projects like this. James On 24/07/16 09:09, Sean Owen wrote: Also also, you may be interested in GATK, built on Spark, for genomics: https://github.com/broadinstitute/gatk On Sun, Jul 24, 2016 at 7:56 AM, Ofir Manor wrote: Hi James, BTW - if yo

UNSUBSCRIBE

2016-08-09 Thread James Ding
smime.p7s Description: S/MIME cryptographic signature

Re: Extract all the values from describe

2016-02-08 Thread James Barney
Hi Arunkumar, >From the scala documentation it's recommended to use the agg function for performing any actual statistics programmatically on your data. df.describe() is meant only for data exploration. See Aggregator here: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.

pyspark.DataFrame.dropDuplicates

2016-02-12 Thread James Barney
s which row is kept and which is deleted? First to appear? Or random? I would like to guarantee that the row with the longest list itemsInPocket is kept. How can I do that? Thanks, James

Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
2xlarge and master is m4.large. Could this contribute to any problems running the jobs? Regards, James

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
I'm fairly new to Spark. The documentation suggests using the spark-ec2 script to launch clusters in AWS, hence I used it. Would EMR offer any advantage? Regards, James On 18 February 2016 at 14:04, Gourav Sengupta wrote: > Hi, > > Just out of sheet curiosity why are you no

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
I have now... So far I think the issues I've had are not related to this, but I wanted to be sure in case it should be something that needs to be patched. I've had some jobs run successfully but this warning appears in the logs. Regards, James On 18 February 2016 at 12:23, Ted

Re: Is this likely to cause any problems?

2016-02-19 Thread James Hammerton
http://spark.apache.org/docs/latest/index.html) mentions EC2 but not EMR. Regards, James On 19 February 2016 at 14:25, Daniel Siegmann wrote: > With EMR supporting Spark, I don't see much reason to use the spark-ec2 > script unless it is important for you to be able to launch clus

Count job stalling at shuffle stage on 3.4TB input (but only 5.3GB shuffle write)

2016-02-23 Thread James Hammerton
ut=[]) > TungstenExchange hashpartitioning(objectId#0) > TungstenAggregate(key=[objectId#0], functions=[], output=[objectId#0]) > Scan > CsvRelation(,Some(s3n://gluru-research/data/events.prod.2016-02-04/extractedEventsUncompressed),false, > > ,",null,#,PERMISSIVE,COMMONS,false,false,false,StructType(StructField(objectId,StringType,true), > StructField(eventName,StringType,true), > StructField(eventJson,StringType,true), > StructField(timestampNanos,StringType,true)),false,null)[objectId#0] > > Code Generation: true > > Regards, James

Re: How could I do this algorithm in Spark?

2016-02-24 Thread James Barney
Guillermo, I think you're after an associative algorithm where A is ultimately associated with D, correct? Jakob would correct if that is a typo--a sort would be all that is necessary in that case. I believe you're looking for something else though, if I understand correctly. This seems like a si

Re: Sample sql query using pyspark

2016-03-01 Thread James Barney
Maurin, I don't know the technical reason why but: try removing the 'limit 100' part of your query. I was trying to do something similar the other week and what I found is that each executor doesn't necessarily get the same 100 rows. Joins would fail or result with a bunch of nulls when keys weren

Re: How to control the number of parquet files getting created under a partition ?

2016-03-02 Thread James Hammerton
r of partitions in the DataFrame by using coalesce() before saving the data. Regards, James On 1 March 2016 at 21:01, SRK wrote: > Hi, > > How can I control the number of parquet files getting created under a > partition? I have my sqlContext queries to create a table and inse

Setting executors per worker - Standalone

2015-09-28 Thread James Pirz
Hi, I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while each machine has 12GB of RAM and 4 cores. On each machine I have one worker which is running one executor that grabs all 4 cores. I am interested to check the performance with "one worker but 4 executors per machine - each

Re: Setting executors per worker - Standalone

2015-09-28 Thread James Pirz
have 4 cores per worker > > > > On Tue, Sep 29, 2015 at 8:24 AM, James Pirz wrote: > >> Hi, >> >> I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while >> each machine has 12GB of RAM and 4 cores. On each machine I have one worker >&g

Re: Setting executors per worker - Standalone

2015-09-29 Thread James Pirz
environment tab on the Application UI > > --- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > >

Re: No suitable drivers found for postgresql

2015-11-13 Thread James Nowell
I recently had this same issue. Though I didn't find the cause, I was able to work around it by loading the JAR into hdfs. Once in HDFS, I used the --jars flag with the full hdfs path: --jars hdfs://{our namenode}/tmp/postgresql-9.4-1204-jdbc42.jar James On Fri, Nov 13, 2015 at 10:14 AM s

Re: No suitable drivers found for postgresql

2015-11-13 Thread James Nowell
path /usr/local/share/jupyter/kernels/postgres/postgresql-9.4-1204.jdbc42.jar --executor-memory 1G --total-executor-cores 15 pyspark-shell" James On Fri, Nov 13, 2015 at 12:12 PM Krishna Sangeeth KS < kskrishnasange...@gmail.com> wrote: > ​​ > ​Hi,​ > > I have been trying t

Recursive nested wildcard directory walking in Spark

2015-12-09 Thread James Ding
Hi! My name is James, and I’m working on a question there doesn’t seem to be a lot of answers about online. I was hoping spark/hadoop gurus could shed some light on this. I have a data feed on NFS that looks like /foobar/.gz Currently I have a spark scala job that calls sparkContext.textFile

Re: Recursive nested wildcard directory walking in Spark

2015-12-09 Thread James Ding
/*.gz). Any thoughts or workarounds? I’m considering using bash globbing to match files recursively and feed hundreds of thousands of arguments to spark-submit. Reasons for/against? From: Ted Yu Date: Wednesday, December 9, 2015 at 3:50 PM To: James Ding Cc: "user@spark.apache.org"

[POWERED BY] Please add our organization

2015-07-23 Thread Baxter, James
quality analysis and data exploration. Regards James Baxter Technology and Innovation Analyst IS&S Woodside Energy Ltd. Woodside Plaza 240 St Georges Terrace Perth WA 6000 Australia T: +61 8 9348 4218 F: +61 8 9348 6561 E: james.bax...@woodside.com.au<mailto:james.bax...@woodside.com.au>

worker and executor memory

2015-08-13 Thread James Pirz
Hi, I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables), a

Re: worker and executor memory

2015-08-14 Thread James Pirz
scheduled that way, as it is a map-only job and reading can happen in parallel. On Thu, Aug 13, 2015 at 9:10 PM, James Pirz wrote: > Hi, > > I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, > for a workload similar to TPCH (analytical queries with multiple/multi

Repartitioning external table in Spark sql

2015-08-18 Thread James Pirz
I am using Spark 1.4.1 , in stand-alone mode, on a cluster of 3 nodes. Using Spark sql and Hive Context, I am trying to run a simple scan query on an existing Hive table (which is an external table consisting of rows in text files stored in HDFS - it is NOT parquet, ORC or any other richer format)

Feedback: Feature request

2015-08-26 Thread Murphy, James
Hey all, In working with the DecisionTree classifier, I found it difficult to extract rules that could easily facilitate visualization with libraries like D3. So for example, using : print(model.toDebugString()), I get the following result = If (feature 0 <= -35.0) If (feature 24 <= 176.0

RE: Feedback: Feature request

2015-08-28 Thread Murphy, James
This is great and much appreciated. Thank you. - Jim From: Manish Amde [mailto:manish...@gmail.com] Sent: Friday, August 28, 2015 9:20 AM To: Cody Koeninger Cc: Murphy, James; user@spark.apache.org; d...@spark.apache.org Subject: Re: Feedback: Feature request Sounds good. It's a request I

Java UDFs in GROUP BY expressions

2015-09-07 Thread James Aley
illustrate the issue. The equivalent code from Scala seems to work fine for me. Is anyone else seeing this problem? For us, the attached code fails every time on Spark 1.4.1 Thanks, James

Unreachable dead objects permanently retained on heap

2015-09-25 Thread James Aley
bout this? Is there a way to reclaim this memory? Should those arrays be GC'ed when jobs finish? Any guidance greatly appreciated. Many thanks, James.

Anyone attending spark summit?

2016-10-11 Thread Andrew James
Hey, I just found a promo code for Spark Summit Europe that saves 20%. It’s "Summit16" - I love Brussels and just registered! Who’s coming with me to get their Spark on?! Cheers, Andrew

RE: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Gabriel James
Me too. Experiences and recommendations please. Gabriel From: Kevin Wang [mailto:buz...@gmail.com] Sent: Wednesday, April 12, 2017 6:11 AM To: Alonso Isidoro Roman Cc: Gaurav1809 ; user@spark.apache.org Subject: Re: Any NLP library for sentiment analysis in Spark? I am also interested

unsubscribe

2018-02-01 Thread James Casiraghi
unsubscribe

Newbie question on how to extract column value

2018-08-07 Thread James Starks
I am very new to Spark. Just successfully setup Spark SQL connecting to postgresql database, and am able to display table with code sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show() Now I want to perform filter and map function on col_b value. In plain scala it would

Re: Newbie question on how to extract column value

2018-08-07 Thread James Starks
ation on url (id, derived_data) }.show() Thanks for the advice, it's really helpful! ‐‐‐ Original Message ‐‐‐ On August 7, 2018 5:33 PM, Gourav Sengupta wrote: > Hi James, > > It is always advisable to use the latest SPARK version. That said, can you > please gi

Data source jdbc does not support streamed reading

2018-08-08 Thread James Starks
Now my spark job can perform sql operations against database table. Next I want to combine that with streaming context, so switching to readStream() function. But after job submission, spark throws Exception in thread "main" java.lang.UnsupportedOperationException: Data source jdbc does no

  1   2   3   >