Re: Broadcast Join and Inner Join giving different result on same DataFrame

2017-01-03 Thread Patrick
Hi, An Update on above question: In Local[*] mode code is working fine. The Broadcast size is 200MB, but on Yarn it the broadcast join is giving empty result.But in Sql Query in UI, it does show BroadcastHint. Thanks On Fri, Dec 30, 2016 at 9:15 PM, titli batali wrote: > Hi, > > I have two da

groupByKey vs mapPartitions for efficient grouping within a Partition

2017-01-16 Thread Patrick
Hi, Does groupByKey has intelligence associated with it, such that if all the keys resides in the same partition, it should not do the shuffle? Or user should write mapPartitions( scala groupBy code). Which would be more efficient and what are the memory considerations? Thanks

Efficient Spark-Sql queries when only nth Column changes

2017-02-18 Thread Patrick
Hi, I have read 5 columns from parquet into data frame. My queries on the parquet table is of below type: val df1 = sqlContext.sql(select col1,col2,count(*) from table groupby col1,col2) val df2 = sqlContext.sql(select col1,col3,count(*) from table groupby col1,col3) val df3 = sqlContext.sql(sel

Re: Efficient Spark-Sql queries when only nth Column changes

2017-02-19 Thread Patrick
t;).cache > > df_base.registerTempTable("df_base") > > val df1 = sqlContext.sql("select col1, col2, count(*) from df_base group > by col1, col2") > > val df2 = // similar logic > > Yong > -- > *From:* Patrick > *Sent:*

Querying on Deeply Nested JSON Structures

2017-07-15 Thread Patrick
Hi, We need to query deeply nested Json structure. However query is on a single field at a nested level such as mean, median, mode. I am aware of the sql explode function. df = df_nested.withColumn('exploded', explode(top)) But this is too slow. Is there any other strategy that could give us t

Complext JSON Handling in Spark 2.1

2017-07-24 Thread Patrick
Hi, On reading a complex JSON, Spark infers schema as following: root |-- header: struct (nullable = true) ||-- deviceId: string (nullable = true) ||-- sessionId: string (nullable = true) |-- payload: struct (nullable = true) ||-- deviceObjects: array (nullable = true) ||

Re: Complext JSON Handling in Spark 2.1

2017-07-24 Thread Patrick
To avoid confusion, the query i am referring above is over some numeric element inside *a: struct (nullable = true).* On Mon, Jul 24, 2017 at 4:04 PM, Patrick wrote: > Hi, > > On reading a complex JSON, Spark infers schema as following: > > root > |-- header: struc

Re: Nested JSON Handling in Spark 2.1

2017-07-25 Thread Patrick
Hi, I would appreciate some suggestions on how to achieve top level struct treatment to nested JSON when stored in Parquet format. Or any other solutions for best performance using Spark 2.1. Thanks in advance On Mon, Jul 24, 2017 at 4:11 PM, Patrick wrote: > To avoid confusion, the quer

Re: Complex types projection handling with Spark 2 SQL and Parquet

2017-07-27 Thread Patrick
Hi , I am having the same issue. Has any one found solution to this. When i convert the nested JSON to parquet. I dont see the projection working correctly. It still reads all the nested structure columns. Parquet does support nested column projection. Does Spark 2 SQL provide the column project

Projection Pushdown and Predicate Pushdown in Parquet for Nested Column

2017-08-02 Thread Patrick
Hi, I would like to know that if Spark has support for Projection Pushdown and Predicate Pushdown in Parquet for nested column.? I can see two JIRA tasks with PR. https://issues.apache.org/jira/browse/SPARK-17636 https://issues.apache.org/jira/browse/SPARK-4502 If not, are we seeing these f

Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick
Hi I have two lists: - List one: contains names of columns on which I want to do aggregate operations. - List two: contains the aggregate operations on which I want to perform on each column eg ( min, max, mean) I am trying to use spark 2.0 dataset to achieve this. Spark provides an

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick
Ah, does it work with Dataset API or i need to convert it to RDD first ? On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler wrote: > What about the rdd stat counter? https://spark.apache.org/docs/ > 0.6.2/api/core/spark/util/StatCounter.html > > Patrick schrieb am Mo. 28. Aug. 2

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick
particular column. I was thinking if we need to write some custom code which does this in one action(job) that would work for me On Tue, Aug 29, 2017 at 12:02 AM, Georg Heiler wrote: > Rdd only > Patrick schrieb am Mo. 28. Aug. 2017 um 20:13: > >> Ah, does it work with Dataset A

Builder Pattern used by Spark source code architecture

2017-09-18 Thread Patrick
Hi, A lot of code base of Spark is based on Builder Pattern, so i was wondering what are the benefits that Builder Pattern brings to spark. Some of things that comes in my mind, it is easy on garbage collection and also user friendly API's. Are their any other advantages with code running on dis

Out of memory Error when using Collection Accumulator Spark 2.2

2018-02-26 Thread Patrick
Hi, We were getting OOM error when we are accumulating the results of each worker. We were trying to avoid collecting data to driver node instead used accumulator as per below code snippet, Is there any spark config to set the accumulator settings Or am i doing the wrong way to collect the huge d

Spark Mllib logistic regression setWeightCol illegal argument exception

2020-01-09 Thread Patrick
Hi Spark Users, I am trying to solve a class imbalance problem, I figured out, spark supports setting weight in its API but I get IIlegal Argument exception weight column do not exist, but it do exists in the dataset. Any recommedation to go about this problem ? I am using Pipeline API with Logist

Building Spark 3.0.0 for Hive 1.2

2020-07-10 Thread Patrick McCarthy
on/pyspark/sql/session.py", line 191, in getOrCreate session._jsparkSession.sessionState().conf().setConfString(key, value) File "/home/pmccarthy/custom-spark-3/python/lib/py4j-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/home/pmccarthy/custom-spark-3/p

Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Patrick McCarthy
; > Best regards >> > Mukhtaj >> > >> > >> > >> > >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
ning those columns with list comprehensions forming a single select() statement makes for a smaller DAG. On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira wrote: > Hi Patrick, thank you for your quick response. > That's exactly what I think. Actually, the result of this processing

Re: regexp_extract regex for extracting the columns from string

2020-08-10 Thread Patrick McCarthy
f function. > apart from udf,is there any way to achieved it. > > > Thanks > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > ----- > To unsubscribe e-mail: user-unsubscr...@spark

Re: Hive using Spark engine vs native spark with hive integration.

2020-10-07 Thread Patrick McCarthy
rol some of the > performance features, for example things like caching/evicting etc. > > > > > > Any advice on this is much appreciated. > > > > > > Thanks, > > -Manu > > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-10-30 Thread Patrick McCarthy
gt; ) > > Are there other Spark patterns that I should attempt in order to achieve > my end goal of a vector of attributes for every entity? > > Thanks, Daniel > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Issue while installing dependencies Python Spark

2020-12-17 Thread Patrick McCarthy
ath/to/venv/bin/python3 --conf > spark.pyspark.driver.python=/path/to/venv/bin/python3 > > This did not help too.. > > Kind Regards, > Sachit Murarka > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016

Re: Getting error message

2020-12-17 Thread Patrick McCarthy
the > system, program starts running fine. > This error goes away on > > On Thu, 17 Dec 2020, 23:50 Patrick McCarthy, > wrote: > >> my-domain.com/192.168.166.8:63534 probably isn't a valid address on your >> network, is it? >> >> On Thu, Dec 17, 2020

Re: Getting error message

2020-12-17 Thread Patrick McCarthy
running code in a local machine that is single node machine. > > Getting into logs, it looked like the host is killed. This is happening > very frequently an I am unable to find the reason of this. > > Could low memory be the reason? > > On Fri, 18 Dec 2020, 00:11 Patrick

Re: Issue while installing dependencies Python Spark

2020-12-18 Thread Patrick McCarthy
risk? In either case you move about the same number of bytes around. On Fri, Dec 18, 2020 at 3:04 PM Sachit Murarka wrote: > Hi Patrick/Users, > > I am exploring wheel file form packages for this , as this seems simple:- > > > https://bytes.grubhub.com/managing-dependencie

Profiling options for PandasUDF (2.4.7 on yarn)

2021-05-28 Thread Patrick McCarthy
lumns of (count, row_id, column_id). It works at small scale but gets unstable as I scale up. Is there a way to profile this function in a spark session or am I limited to profiling on pandas data frames without spark? -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engine

[Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-12 Thread Patrick Tucci
7;ABC' like 'Ab%' false Time taken: 5.439 seconds, Fetched 1 row(s) Desired behavior would be true for all of the above with the proposed case-insensitive flag set. Thanks, Patrick

RE: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-21 Thread Patrick Tucci
Is this the wrong list for this type of question? On 2022/11/12 16:34:48 Patrick Tucci wrote: > Hello, > > Is there a way to set string comparisons to be case-insensitive globally? I > understand LOWER() can be used, but my codebase contains 27k lines of SQL > and many string

RE: Re: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-22 Thread Patrick Tucci
Thanks. How would I go about formally submitting a feature request for this? On 2022/11/21 23:47:16 Andrew Melo wrote: > I think this is the right place, just a hard question :) As far as I > know, there's no "case insensitive flag", so YMMV > > On Mon, Nov 21, 20

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Patrick Tucci
Window functions don't work like traditional GROUP BYs. They allow you to partition data and pull any relevant column, whether it's used in the partition or not. I'm not sure what the syntax is for PySpark, but the standard SQL would be something like this: WITH InputData AS ( SELECT 'USA' Coun

Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
ATE TABLE operation should take more than 24x longer than a simple SELECT COUNT(*) statement. Thanks for any help. Please let me know if I can provide any additional information. Patrick Create Table.sql Description: Binary data --

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
allel. The same CTAS query only took about 45 minutes. This is still a bit slower than I had hoped, but the import from bzip fully utilized all available cores. So we can give the cluster more resources if we need the process to go faster. Patrick On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh

Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Patrick Tucci
10.0.50.1:8020/user/spark/warehouse/eventclaims. Is it possible to have multiple concurrent writers to the same table with Spark SQL? Is there any way to make this work? Thanks for the help. Patrick

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
hanks, Patrick On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria wrote: > Hi Patrick, > > You can have multiple writers simultaneously writing to the same table in > HDFS by utilizing an open table format with concurrency control. Several > formats, such as Apache Hudi, Apache Iceb

Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
PILED, but no stages or tasks are executing or pending: [image: image.png] I've let the query run for as long as 30 minutes with no additional stages, progress, or errors. I'm not sure where to start troubleshooting. Thanks for your help, Patrick

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
0.50.1:1 -n hadoop -f command.sql Thanks again for your help. Patrick On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh wrote: > Can you run this sql query through hive itself? > > Are you using this command or similar for your thrift server? > > beeline

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
hich was part of the reason why I chose it. Thanks again for the reply, I truly appreciate your help. Patrick On Thu, Aug 10, 2023 at 3:43 PM Mich Talebzadeh wrote: > sorry host is 10.0.50.1 > > Mich Talebzadeh, > Solutions Architect/Engineering Lead > London > United Kingdom >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
local-mode-and-connect-to-delta-using-jdbc Thanks again to everyone who replied for their help. Patrick On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh wrote: > Steve may have a valid point. You raised an issue with concurrent writes > before, if I recall correctly. Since this limitation

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
g to migrate to Delta Lake and see if that solves the issue. Thanks again for your feedback. Patrick On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh wrote: > Hi Patrick, > > There is not anything wrong with Hive On-premise it is the best data > warehouse there is > > Hive handles

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destructi

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
k. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, dam

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
ssage that the driver didn't have enough memory to broadcast objects. After increasing the driver memory, the query runs without issue. I hope this can be helpful to someone else in the future. Thanks again for the support, Patrick On Sun, Aug 13, 2023 at 7:52 AM Mich Talebzadeh wrote: >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
ly to this thread if the issue comes up again (hopefully it doesn't!). Thanks again, Patrick On Thu, Aug 17, 2023 at 1:54 PM Mich Talebzadeh wrote: > Hi Patrik, > > glad that you have managed to sort this problem out. Hopefully it will go > away for good. > > Still we are i

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
acquires all available cluster resources when it starts. This is okay; as of right now, I am the only user of the cluster. If I add more users, they will also be SQL users, submitting queries through the Thrift server. Let me know if you have any other questions or thoughts. Thanks, Patrick On Thu

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
from > such loss, damage or destruction. > > > > > On Thu, 17 Aug 2023 at 21:01, Patrick Tucci > wrote: > >> Hi Mich, >> >> Here are my config values from spark-defaults.conf: >> >> spark.eventLog.enabled true >> spark.eventLog.dir hdfs:/

Re: Spark stand-alone mode

2023-09-15 Thread Patrick Tucci
I use Spark in standalone mode. It works well, and the instructions on the site are accurate for the most part. The only thing that didn't work for me was the start_all.sh script. Instead, I use a simple script that starts the master node, then uses SSH to connect to the worker machines and start t

Re: Spark stand-alone mode

2023-09-19 Thread Patrick Tucci
Multiple applications can run at once, but you need to either configure Spark or your applications to allow that. In stand-alone mode, each application attempts to take all resources available by default. This section of the documentation has more details: https://spark.apache.org/docs/latest/spar

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Patrick Tucci
select rev.* to select I.*. This will show you the records from item that the join produces. If the first part of the code only returns one record, I expect you will see 4 distinct records returned here. Thanks, Patrick On Sun, Oct 22, 2023 at 1:29 AM Meena Rajani wrote: > Hello all: > &

unsubscribe

2023-11-09 Thread Duflot Patrick
unsubscribe

Re: Fully in-memory shuffles

2015-06-10 Thread Patrick Wendell
In many cases the shuffle will actually hit the OS buffer cache and not ever touch spinning disk if it is a size that is less than memory on the machine. - Patrick On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet wrote: > So with this... to help my understanding of Spark under the hood- > >

[ANNOUNCE] Announcing Spark 1.4

2015-06-11 Thread Patrick Wendell
Hi All, I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is the fifth release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 210 developers and more than 1,000 commits! A huge thanks go to all of the individuals and organizations invo

Re: Fully in-memory shuffles

2015-06-11 Thread Patrick Wendell
tle because the job will fail if shuffle output exceeds memory. - Patrick On Wed, Jun 10, 2015 at 9:50 PM, Davies Liu wrote: > If you have enough memory, you can put the temporary work directory in > tempfs (in memory file system). > > On Wed, Jun 10, 2015 at 8:43 PM, Corey Nolet wro

Dynamic allocator requests -1 executors

2015-06-12 Thread Patrick Woody
Hey all, I've recently run into an issue where spark dynamicAllocation has asked for -1 executors from YARN. Unfortunately, this raises an exception that kills the executor-allocation thread and the application can't request more resources. Has anyone seen this before? It is spurious and the appl

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Patrick Woody
Hey Sandy, I'll test it out on 1.4. Do you have a bug number or PR that I could reference as well? Thanks! -Pat Sent from my iPhone > On Jun 13, 2015, at 11:38 AM, Sandy Ryza wrote: > > Hi Patrick, > > I'm noticing that you're using Spark 1.3.1. We fixed a b

Get Spark version before starting context

2015-07-04 Thread Patrick Woody
Hey all, Is it possible to reliably get the version string of a Spark cluster prior to trying to connect via the SparkContext on the client side? Most of the errors I've seen on mismatched versions have been cryptic, so it would be helpful if I could throw an exception earlier. I know it is conta

Re: Get Spark version before starting context

2015-07-04 Thread Patrick Woody
To somewhat answer my own question - it looks like an empty request to the rest API will throw an error which returns the version in JSON as well. Still not ideal though. Would there be any objection to adding a simple version endpoint to the API? On Sat, Jul 4, 2015 at 4:00 PM, Patrick Woody

SparkHub: a new community site for Apache Spark

2015-07-10 Thread Patrick Wendell
Spark community, and we welcome input from you as well! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Announcing Spark 1.4.1!

2015-07-15 Thread Patrick Wendell
spark-release-1-4-1.html Comprehensive list of fixes - http://s.apache.org/spark-1.4.1 Thanks to the 85 developers who worked on this release! Please contact me directly for errata in the release notes. - Patrick - To unsubscribe, e

Streaming WriteAheadLogBasedBlockHandler disallows parellism via StorageLevel replication factor

2016-04-13 Thread Patrick McGloin
distributed more evenly. But won't that "waste" 4 cores in our example, where one would do? Best regards, Patrick

Re: Behaviour of RDD sampling

2016-05-31 Thread Patrick Baier
plemented this, and in the end it took as long as loading the whole > data > > set. > > So I was wondering if Spark is still loading the whole dataset from disk > and > > does the filtering afterwards? > > If this is the case, why does Spark not push down the filtering and l

Create external table with partitions using sqlContext.createExternalTable

2016-06-14 Thread Patrick Duin
e options: Map[String, String] parameter? Thanks, Patrick

Re: Create external table with partitions using sqlContext.createExternalTable

2016-06-14 Thread Patrick Duin
ot; > sql(sqltext) > sql("select count(1) from test.orctype").show > > res2: org.apache.spark.sql.DataFrame = [result: string] > +---+ > |_c0| > +---+ > | 0| > +---+ > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://ww

Spark 1.6.2 short circuit AND filter broken

2016-07-07 Thread Patrick Woody
Hey all, I hit a pretty nasty bug on 1.6.2 that I can't reproduce on 2.0. Here is the code/logical plan http://pastebin.com/ULnHd1b6. I have filterPushdown disabled, so when I call collect here it hits the Exception in my UDF before doing a null check on the input. I believe it is a symptom of ho

DataFrame partitionBy to a single Parquet file (per partition)

2016-01-14 Thread Patrick McGloin
e. I could rewrite this to do the partitioning manually (using filter with the distinct partition values for example) before calling coalesce. But is there a better way to do this using the standard Spark SQL API? Best regards, Patrick

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Patrick McGloin
ot;entity", "year", "month", "day", > "status").write.partitionBy("entity", "year", "month", "day", > "status").mode(SaveMode.Append).parquet(s"$location") > > (Unfortunately the nam

Spark Streaming Write Ahead Log (WAL) not replaying data after restart

2016-01-21 Thread Patrick McGloin
are processed but none from the previous batch. The old WAL logs are cleared and I see log messages like this but the old data does not get processed. INFO WriteAheadLogManager : Recovered 1 write ahead log files from hdfs://myhdfsserver/user/spark/checkpoint/DStreamResilienceTest/receivedData/0 What am I doing wrong? I am using Spark 1.5.2. Best regards, Patrick

Getting top distinct strings from arraylist

2016-01-25 Thread Patrick Plaatje
strings JavaRDD> transactions = sc.textFile(dataPath).map( new Function>() { private static final long serialVersionUID = 1L; @Override public ArrayList call(String s) { return Lists.newArrayList(s.split(" ")); } } ); Any ideas? Thanks! Patrick

Re: Spark Streaming Write Ahead Log (WAL) not replaying data after restart

2016-01-26 Thread Patrick McGloin
configure-checkpointing > > On Thu, Jan 21, 2016 at 3:32 AM, Patrick McGloin < > mcgloin.patr...@gmail.com> wrote: > >> Hi all, >> >> To have a simple way of testing the Spark Streaming Write Ahead Log I >> created a very simple Custom Input Receiver, which wi

“java.io.IOException: Class not found” on long running Streaming application

2016-01-28 Thread Patrick McGloin
I am getting the exception below on a long running Spark Streaming application. The exception could occur after a few minutes, but it may also may not happen for days. This is with pretty consistent input data. I have seen this Jira ticket ( https

Understanding Spark Task failures

2016-01-28 Thread Patrick McGloin
I am trying to understand what will happen when Spark has an exception during processing, especially while streaming. If I have a small code spinet like this: myDStream.foreachRDD { (rdd: RDD[String]) => println(s"processed => [${rdd.collect().toList}]") throw new Exception("User exception...

Re: Understanding Spark Task failures

2016-01-28 Thread Patrick McGloin
that app code. > > On Thu, Jan 28, 2016 at 8:51 AM, Patrick McGloin < > mcgloin.patr...@gmail.com> wrote: > >> I am trying to understand what will happen when Spark has an exception >> during processing, especially while streaming. >> >> If I have a small co

Re: Recommended storage solution for my setup (~5M items, 10KB pr.)

2016-02-04 Thread Patrick Skjennum
econd attempt, but this time around I'll ask for help:p -- mvh Patrick Skjennum On 04.02.2016 22.14, Ted Yu wrote: bq. had a hard time setting it up Mind sharing your experience in more detail :-) If you already have a hadoop cluster, it should be relatively straight forward to setup

Re: newbie unable to write to S3 403 forbidden error

2016-02-13 Thread Patrick Plaatje
Not sure if it’s related, but in our Hadoop configuration we’re also setting sc.hadoopConfiguration().set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem”); Cheers, -patrick From: Andy Davidson Date: Friday, 12 February 2016 at 17:34 To: Igor Berman Cc

RE: Submitting Jobs Programmatically

2016-02-21 Thread Patrick Mi
Hi there, I had similar problem in Java with the standalone cluster on Linux but got that working by passing the following option -Dspark.jars=file:/path/to/sparkapp.jar sparkapp.jar has the launch application Hope that helps. Regards, Patrick -Original Message- From: Arko Provo

Re: Spark UI consuming lots of memory

2015-10-27 Thread Patrick McGloin
Hi Nicholas, I think you are right about the issue relating to Spark-11126, I'm seeing it as well. Did you find any workaround? Looking at the pull request for the fix it doesn't look possible. Best regards, Patrick On 15 October 2015 at 19:40, Nicholas Pritchard < ni

Data in one partition after reduceByKey

2015-11-20 Thread Patrick McGloin
duceByKey results in data in only one partition. Thanks, Patrick

Re: Data in one partition after reduceByKey

2015-11-23 Thread Patrick McGloin
h + dateTime.getMinuteOfDay + dateTime.getSecondOfDay sum % numPartitions case _ => 0 } } } On 20 November 2015 at 17:17, Patrick McGloin wrote: > Hi, > > I have Spark application which contains the following segment: > > val reparitioned = rdd.repartition(16) > val

Driver Hangs before starting Job

2015-12-01 Thread Patrick Brown
still orders of magnitude larger than any processing time. Thanks, Patrick

Class weights and prediction probabilities in random forest?

2015-07-23 Thread Patrick Crenshaw
I was just wondering if there were plans to implement class weights and prediction probabilities in random forest? Is anyone working on this? smime.p7s Description: S/MIME cryptographic signature

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
How can I tell if it's the sample stream or full stream ? Thanks Sent from my iPhone On Jul 23, 2015, at 4:17 PM, Enno Shioji mailto:eshi...@gmail.com>> wrote: You are probably listening to the sample stream, and THEN filtering. This means you listen to 1% of the twitter stream, and then looki

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
Ahh Makes sense - thanks for the help Sent from my iPhone On Jul 23, 2015, at 4:29 PM, Enno Shioji mailto:eshi...@gmail.com>> wrote: You need to pay a lot of money to get the full stream, so unless you are doing that, it's the sample stream! On Thu, Jul 23, 2015 at 9:26 PM, Patri

Re: Extremely poor predictive performance with RF in mllib

2015-08-04 Thread Patrick Lam
resInfo={}, numTrees=100, seed=422) >>> >>> rf_predict = rf.predict(predict_feat) >>> >>> rf_predict.sum() >>> 0.0 >>> >>> This code was all run back to back so I didn't change anything in >>> between. >>> Does anybody

Can't use RandomForestClassificationModel.predict(Vector v) in scala

2016-12-15 Thread Patrick Chen
Index(), indexes, values); double result = rfModel.predict(v) But when I changed to scala , I couldn't use predict method in Classifier anymore. How should I use the RandomForestClassificationModel in Scala ? Thanks BR Patrick

Re: FP growth - Items in a transaction must be unique

2017-02-02 Thread Patrick Plaatje
Hi, This indicates you have duplicate products per row in your dataframe, the FP implementation only allows unique products per row, so you will need to dedupe duplicate products before running the FPGrowth algorithm. Best, Patrick From: "Devi P.V" Date: Thursday, 2 Feb

Re: Market Basket Analysis by deploying FP Growth algorithm

2017-04-05 Thread Patrick Plaatje
the FPGrowth implementation starts spilling over to disk, and we had to increase the /tmp partition. Hope it helps. BR, -patrick On 05/04/2017, 10:29, "asethia" wrote: Hi, We are currently working on a Market Basket Analysis by deploying FP Growth algorithm o

Fwd: ERROR Dropping SparkListenerEvent

2017-04-13 Thread Patrick Gomes
the scheduler. This happens when I call the fit method of Count Vectorizer on a fairly small dataset (< 20 GB). Running on a cluster with 5 nodes (c3.8xlarge), Spark 2.1, and Hadoop 2.7. If there is anything else that would be helpful to know just let me know and I can include it. Best, Patrick

Memory problems with simple ETL in Pyspark

2017-04-14 Thread Patrick McCarthy
Hello, I'm trying to build an ETL job which takes in 30-100gb of text data and prepares it for SparkML. I don't speak Scala so I've been trying to implement in PySpark on YARN, Spark 2.1. Despite the transformations being fairly simple, the job always fails by running out of executor memory. The

Re: Memory problems with simple ETL in Pyspark

2017-04-16 Thread Patrick McCarthy
gt; > On Sun, 16 Apr 2017 at 11:06 am, ayan guha wrote: > >> It does not look like scala vs python thing. How big is your audience >> data store? Can it be broadcasted? >> >> What is the memory footprint you are seeing? At what point yarn is >> killing? Depenedi

Maximum Partitioner size

2017-04-20 Thread Patrick GRANDJEAN
Hi, I have implemented a custom Partitioner (org.apache.spark.Partitioner) that contains a medium-sized object (some megabytes). Unfortunately Spark (2.1.0) fails with a StackOverflowError, and I suspect it is because of the size of the partitioner that needs to be serialized. My question is, wh

Structured Streaming + initialState

2017-05-05 Thread Patrick McGloin
the Spark Streaming StateSpec did). Are there any plans to add support for initial states? Or is there already a way to do so? Best regards, Patrick

Re: Structured Streaming + initialState

2017-05-06 Thread Patrick McGloin
n a database, then when initialize the GroupState, you can fetch > it from the database. > > On Fri, May 5, 2017 at 7:35 AM, Patrick McGloin > wrote: > >> Hi all, >> >> With Spark Structured Streaming, is there a possibility to set an >> "initial state"

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread Patrick McGloin
# Write key-value data from a DataFrame to a Kafka topic specified in an option query = df \ .selectExpr("CAST(userId AS STRING) AS key", "to_json(struct(*)) AS value") \ .writeStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ .option("topic", "to

Re: GC overhead exceeded

2017-08-18 Thread Patrick Alwell
+1 what is the executor memory? You may need to adjust executor memory and cores. For the sake of simplicity; each executor can handle 5 concurrent tasks and should have 5 cores. So if your cluster has 100 cores, you’d have 20 executors. And if your cluster memory is 500gb, each executor would h

Re: How to authenticate to ADLS from within spark job on the fly

2017-08-19 Thread Patrick Alwell
This might help; I’ve built a REST API with livyServer: https://livy.incubator.apache.org/ From: Steve Loughran Date: Saturday, August 19, 2017 at 7:05 AM To: Imtiaz Ahmed Cc: "user@spark.apache.org" Subject: Re: How to authenticate to ADLS from within spark job on the fly On 19 Aug 2017,

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Patrick McCarthy
You might benefit from watching this JIRA issue - https://issues.apache.org/jira/browse/SPARK-19071 On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem wrote: > Is there a way to parallelize multiple ML algorithms in Spark. My use case > is something like this: > > A) Run multiple machine learning alg

Re: CSV write to S3 failing silently with partial completion

2017-09-07 Thread Patrick Alwell
Sounds like an S3 bug. Can you replicate locally with HDFS? Try using S3a protocol too; there is a jar you can leverage like so: spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 my_spark_program.py EMR can sometimes be buggy. :/ You could also try le

[Spark Dataframe] How can I write a correct filter so the Hive table partitions are pruned correctly

2017-09-13 Thread Patrick Duin
Our case is where we have a lot of partitions on a table and the calls that result in all the partitions take minutes as well as causing us memory issues. Is this a bug or is there a better way of doing the filter call? Thanks, Patrick PS: Sorry for crossposting I wasn't sure if the user

PySpark - Expand rows into dataframes via function

2017-10-02 Thread Patrick McCarthy
9', '161', 'ff26920a408f15613096aa7fe0ddaa57'], ['23', '239', '162', 'ff26920a408f15613096aa7fe0ddaa57'], ... I have the input lookup table in a pyspark DF, and a python function to do the conversion into the mapped output. I think to produce the full mapping I need a UDTF but this concept doesn't seem to exist in PySpark. What's the best approach to do this mapping and recombine into a new DataFrame? Thanks, Patrick

  1   2   3   4   >