Spark checkpointing in batch mode fault tolerance problem

2025-07-05 Thread Martin Aras
Hi I am new in Apache Spark and I created a spark job that reads the data from a Mysql database and does some processing on it and then commits it to another table. The odd thing I faced was that Spark reads all the data from the table when I use `sparkSession.read.jdbc` and `sparkDf.rdd.map` *wai

4.1.0 release timeline

2025-03-13 Thread Martin Bielik
Hello everyone, I would like to ask what is the estimated timeline for the release of version 4.1.0. We are particularly interested in SPARK-51434, as it is currently blocking integration of Spark into our application. Thank you! Kind regards, Martin

Re: Write Spark Connection client application in Go

2023-09-13 Thread Martin Grund
>> df.Write().Mode("overwrite"). >> Format("parquet"). >> Save("file:///tmp/spark-connect-write-example-output.parquet") >> >> df = spark.Read().Format("parquet"). >> Load("

[Feature Request] make unix_micros() and unix_millis() available in PySpark (pyspark.sql.functions)

2022-10-14 Thread Martin
so available in PySpark. Cheers, Martin

Re: Moving to Spark 3x from Spark2

2022-09-01 Thread Martin Andersson
You should check the release notes and upgrade instructions. From: rajat kumar Sent: Thursday, September 1, 2022 12:44 To: user @spark Subject: Moving to Spark 3x from Spark2 EXTERNAL SENDER. Do not click links or open attachments unless you recognize the sende

Question regarding checkpointing with kafka structured streaming

2022-08-22 Thread Martin Andersson
I was looking around for some documentation regarding how checkpointing (or rather, delivery semantics) is done when consuming from kafka with structured streaming and I stumbled across this old documentation (that still somehow exists in latest versions) at https://spark.apache.org/docs/latest

unsubscribe

2022-08-01 Thread Martin Soch
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Issues getting Apache Spark

2022-05-26 Thread Martin, Michael
annot get Spark to work on my laptop. Michael Martin

Re: Spark on K8s - repeating annoying exception

2022-05-13 Thread Martin Grigorov
Hi, On Mon, May 9, 2022 at 5:57 PM Shay Elbaz wrote: > Hi all, > > > > I apologize for reposting this from Stack Overflow, but it got very little > attention and now comment. > > > > I'm using Spark 3.2.1 image that was built from the official distribution > via `docker-image-tool.sh', on Kubern

Idea for improving performance when reading from hive-like partition folders and specifying a filter [Spark 3.2]

2022-05-01 Thread Martin
dataframe df1 took way longer than df2. Doing some math For simplicity let's assume that we have data of three full years, and every month has 30 days. For df1 that's *8.4 million list operations* (root: 1, year: 3, month: 36, day: 3,240, hour: 8,398,080) For df2 that's 3 list operations; but I did 4 more list operations upfront in oder do come up with the relevant_folders list; so *7 list operations* in total (root: 1, year: 1, month: 1, day: 1, hour: 3) Cheers Martin

Re: Spark on K8s , some applications ended ungracefully

2022-03-31 Thread Martin Grigorov
Hi, On Thu, Mar 31, 2022 at 4:18 PM Pralabh Kumar wrote: > Hi Spark Team > > Some of my spark applications on K8s ended with the below error . These > applications though completed successfully (as per the event log > SparkListenerApplicationEnd event at the end) > stil have even files with .inp

Re: spark distribution build fails

2022-03-17 Thread Martin Grigorov
Hi, For the mail archives: this error happens when the user has MAVEN_OPTS env var pre-exported. In this case ./build/mvn|sbt does not export its own MAVEN_OPTS with the -XssXYZ value, and the default one is too low and leads to the StackOverflowError On Mon, Mar 14, 2022 at 11:13 PM Bulldog20630

Re: Encoders.STRING() causing performance problems in Java application

2022-02-21 Thread martin
one string at a time by the self-built prediction pipeline (which is also using other ML techniques apart from Spark). Needs some re-factoring... Thanks again for the help. Cheers, Martin Am 2022-02-18 13:41, schrieb Sean Owen: That doesn't make a lot of sense. Are you profiling the driv

Re: Encoders.STRING() causing performance problems in Java application

2022-02-18 Thread martin
prediction calls instead of running them line by line on the input file. Cheers, Martin Am 2022-02-18 09:41, schrieb mar...@wunderlich.com: I have been able to partially fix this issue by creating a static final field (i.e. a constant) for Encoders.STRING(). This removes the bottleneck assoc

Re: Encoders.STRING() causing performance problems in Java application

2022-02-18 Thread martin
optimize this? Cheers, Martin Am 2022-02-18 07:42, schrieb mar...@wunderlich.com: Hello, I am working on optimising the performance of a Java ML/NLP application based on Spark / SparkNLP. For prediction, I am applying a trained model on a Spark dataset which consists of one column with only one

Encoders.STRING() causing performance problems in Java application

2022-02-17 Thread martin
whole prediction method call. So, is there a simpler and more efficient way of creating the required dataset, consisting of one column and one String row? Thanks a lot. Cheers, Martin

Re: how can I remove the warning message

2022-02-04 Thread Martin Grigorov
Hi, This is a JVM warning, as Sean explained. You cannot control it via loggers. You can disable it by passing --illegal-access=permit to java. Read more about it at https://softwaregarden.dev/en/posts/new-java/illegal-access-in-java-16/ On Sun, Jan 30, 2022 at 4:32 PM Sean Owen wrote: > This

Re: Spark 3.1 Json4s-native jar compatibility

2022-02-04 Thread Martin Grigorov
Hi, Amit said that he uses Spark 3.1, so the link should be https://github.com/apache/spark/blob/branch-3.1/pom.xml#L879 (3.7.0-M5) @Amit: check your classpath. Maybe there are more jars of this dependency. On Thu, Feb 3, 2022 at 10:53 PM Sean Owen wrote: > You can look it up: > https://github

Re: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread Martin Grigorov
Hi, On Mon, Jan 31, 2022 at 7:57 PM KS, Rajabhupati wrote: > Thanks a lot Sean. One final question before I close the conversion how do > we know what are the features that will be added as part of spark 3.3 > version? > There will be release notes for 3.3 at linked at https://spark.apache.org/

Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread Martin Wunderlich
://www.bsi.bund.de/EN/Home/home_node.html Cheers, Martin Am 13.12.21 um 17:02 schrieb Jörn Franke: Is it in any case appropriate to use log4j 1.x which is not maintained anymore and has other security vulnerabilities which won’t be fixed anymore ? Am 13.12.2021 um 06:06 schrieb Sean Owen :  Check the

Re: [Spark] Does Spark support backward and forward compatibility?

2021-11-24 Thread Martin Wunderlich
to load a model built with Spark 2.4.4 after updating to 3.2.0. This didn't work. Cheers, Martin Am 24.11.21 um 20:18 schrieb Sean Owen: I think/hope that it goes without saying you can't mix Spark versions within a cluster. Forwards compatibility is something you don't genera

Re: EXT: Re: Create Dataframe from a single String in Java

2021-11-18 Thread martin
Thanks a lot, Sebastian and Vibhor. You're right, I can call the createDataset() also on the Spark session. Not sure how I missed that. Cheers, Martin Am 2021-11-18 12:01, schrieb Vibhor Gupta: You can try something like below. It creates a dataset and then converts it into a data

Re: Create Dataframe from a single String in Java

2021-11-18 Thread martin
tions/44028677/how-to-create-a-dataframe-from-a-string Basically, what I am looking for is something simple like: Dataset myData = sparkSession.createDataFrame(textList, "text"); Any hints? Thanks a lot. Cheers, Martin

Create Dataframe from a single String in Java

2021-11-18 Thread martin
e-a-dataframe-from-a-string Basically, what I am looking for is something simple like: Dataset myData = sparkSession.createDataFrame(textList, "text"); Any hints? Thanks a lot. Cheers, Martin

Re: Using MulticlassClassificationEvaluator for NER evaluation

2021-11-11 Thread martin
xt labels, we'll need to work around it and possible create a wrapper evaluator around the Spark standard class. Thanks a lot for the help. Cheers, Martin Am 2021-11-11 13:10, schrieb Gourav Sengupta: Hi Martin, okay, so you will ofcourse need to translate the NER string output

Re: Using MulticlassClassificationEvaluator for NER evaluation

2021-11-11 Thread Martin Wunderlich
labels). Cheers, Martin Am 11.11.21 um 11:39 schrieb Gourav Sengupta: Hi Martin, just to confirm, you are taking the output of SPARKNLP, and then trying to feed it to SPARK ML for running algorithms on the output of NERgenerated by SPARKNLP right? Regards, Gourav Sengupta On Thu, Nov 11

Re: Feature (?): Setting custom parameters for a Spark MLlib pipeline

2021-11-11 Thread martin
-data? This could also be handy not just for things like versioning, but also for storing evaluation metrics together with a trained pipeline (for people who aren't using something like MLFlow, yet). Cheers, Martin Am 2021-10-25 14:38, schrieb Sean Owen: You can write a custom Transform

Re: Using MulticlassClassificationEvaluator for NER evaluation

2021-11-10 Thread martin
would have expected the MulticlassClassificationEvaluator to be able to use the labels directly. I will try to create and propose a code change in this regard, if or when I find the time. Cheers, Martin Am 2021-10-25 14:31, schrieb Sean Owen: I don't think the question is representation

Using MulticlassClassificationEvaluator for NER evaluation

2021-10-25 Thread martin
apply MulticlassClassificationEvaluator to the NER task or is there maybe a better evaluator? I haven't found anything yet (neither in Spark ML nor in SparkNLP). Thanks a lot. Cheers, Martin

Re: Feature (?): Setting custom parameters for a Spark MLlib pipeline

2021-10-25 Thread martin
natively in Spark ML. Otherwise, I'll just create a wrapper class for the trained models. Cheers, Martin Am 2021-10-24 21:16, schrieb Sonal Goyal: Does MLFlow help you? https://mlflow.org/ I don't know if ML flow can save arbitrary key-value pairs and associate them with a

Feature (?): Setting custom parameters for a Spark MLlib pipeline

2021-10-20 Thread martin
does the community think about this proposal? Has it been discussed before perhaps? Any thoughts? Cheers, Martin

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Chris Martin
ris On Wed, Jun 9, 2021 at 5:17 PM Tom Barber wrote: > Yeah to test that I just set the group key to the ID in the record which > is a solr supplied UUID, which means effectively you end up with 4000 > groups now. > > On Wed, Jun 9, 2021 at 5:13 PM Chris Martin wrote: > >

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Chris Martin
One thing I would check is this line: val fetchedRdd = rdd.map(r => (r.getGroup, r)) how many distinct groups do you ended up with? If there's just one then I think you might see the behaviour you observe. Chris On Wed, Jun 9, 2021 at 4:17 PM Tom Barber wrote: > Also just to follow up on th

GPU job in Spark 3

2021-04-09 Thread Martin Somers
s to see what might be failing behind the scenes, any suggestions? Thanks Martin -- M

Structured Streaming together with Cassandra Queries

2018-09-22 Thread Martin Engen
Hello, I have a case where I am continuously getting a bunch sensor-data which is being stored into a Cassandra table (through Kafka). Every week or so, I want to manually enter additional data into the system - and I want this to trigger some calculations merging the manual entered data, and t

How to work around NoOffsetForPartitionException when using Spark Streaming

2018-06-01 Thread Martin Peng
e any quick way to fix the missing offset or work around this? Thanks, Martin 1/06/2018 17:11:02: ERROR:the type of error is org.apache.kafka.clients.consumer.NoOffsetForPartitionException: Undefined offset with no reset policy for partition: elasticsearchtopicrealtimereports-97 01/06/2018 17:

Re: Structured Streaming, Reading and Updating a variable

2018-05-16 Thread Martin Engen
util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Any ideas about how to handle this error? Thanks, Martin Engen ________ From: Lalwani, Jayesh Sent: Tuesday, May 15, 2018 9:59 PM To: Marti

Structured Streaming, Reading and Updating a variable

2018-05-15 Thread Martin Engen
Hello, I'm working with Structured Streaming, and I need a method of keeping a running average based on last 24hours of data. To help with this, I can use Exponential Smoothing, which means I really only need to store 1 value from a previous calculation into the new, and update this variable as

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-25 Thread Martin Peng
cool~ Thanks Kang! I will check and let you know. Sorry for delay as there is an urgent customer issue today. Best Martin 2017-07-24 22:15 GMT-07:00 周康 : > * If the file exists but is a directory rather than a regular file, does > * not exist but cannot be created, or cannot be opened f

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread Martin Peng
Is there anyone at share me some lights about this issue? Thanks Martin 2017-07-21 18:58 GMT-07:00 Martin Peng : > Hi, > > I have several Spark jobs including both batch job and Stream jobs to > process the system log and analyze them. We are using Kafka as the pipeline > to co

Spark Job crash due to File Not found when shuffle intermittently

2017-07-21 Thread Martin Peng
exceptions randomly(either after several hours run or just run in 20 mins). Can anyone give me some suggestions about how to figure out the real root cause? (Looks like google result is not very useful...) Thanks, Martin 00:30:04,510 WARN - 17/07/22 00:30:04 WARN TaskSetManager: Lost task 60.0 in

The stability of Spark Stream Kafka 010

2017-06-29 Thread Martin Peng
-10-integration.html Thanks Martin

stratified sampling scales poorly

2016-12-19 Thread Martin Le
with different sampling fractions (I ran experiments on 4 nodes cluster )? Thank you, Martin

Re: How to use a custom filesystem provider?

2016-09-21 Thread Jean-Philippe Martin
> > There's a bit of confusion setting in here; the FileSystem implementations > spark uses are subclasses of org.apache.hadoop.fs.FileSystem; the nio > class with the same name is different. > grab the google cloud storage connector and put it on your classpath I was using the gs:// filesystem a

How to use a custom filesystem provider?

2016-09-21 Thread Jean-Philippe Martin
The full source for my example is available on github <https://github.com/jean-philippe-martin/SparkRepro>. I'm using maven to depend on gcloud-java-nio <https://mvnrepository.com/artifact/com.google.cloud/gcloud-java-nio/0.2.5>, which provides a Java FileSystem for Google Cloud

DCOS - s3

2016-08-21 Thread Martin Somers
I having trouble loading data from an s3 repo Currently DCOS is running spark 2 so I not sure if there is a modifcation to code with the upgrade my code atm looks like this sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "xxx")

Unsubscribe

2016-08-16 Thread Martin Serrano
Sent from my Verizon Wireless 4G LTE DROID

UNSUBSCRIBE

2016-08-09 Thread Martin Somers
-- M

Unsubscribe.

2016-08-09 Thread Martin Somers
Unsubscribe. Thanks M

Re: sampling operation for DStream

2016-08-01 Thread Martin Le
How to do that? if I put the queue inside .transform operation, it doesn't work. On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger wrote: > Can you keep a queue per executor in memory? > > On Mon, Aug 1, 2016 at 11:24 AM, Martin Le > wrote: > > Hi Cody and all, > >

Re: sampling operation for DStream

2016-08-01 Thread Martin Le
balanced. > > But once you've read the messages, nothing's stopping you from > filtering most of them out before doing further processing. The > dstream .transform method will let you do any filtering / sampling you > could have done on an rdd. > > On Fri, Jul 29,

sampling operation for DStream

2016-07-29 Thread Martin Le
sampling operation as well? If not, could you please give me a suggestion how to implement it? Thanks, Martin

Re: libraryDependencies

2016-07-26 Thread Martin Somers
x27;ll want all of the various spark versions to be the same. > > On Tue, Jul 26, 2016 at 12:34 PM, Michael Armbrust > wrote: > >> If you are using %% (double) then you do not need _2.11. >> >> On Tue, Jul 26, 2016 at 12:18 PM, Martin Somers >> wrote: >&g

libraryDependencies

2016-07-26 Thread Martin Somers
my build file looks like libraryDependencies ++= Seq( // other dependencies here "org.apache.spark" %% "spark-core" % "1.6.2" % "provided", "org.apache.spark" %% "spark-mllib_2.11" % "1.6.0", "org.scalanlp" % "breeze_2.11" % "0.7",

sbt build under scala

2016-07-26 Thread Martin Somers
Just wondering Whats is the correct way of building a spark job using scala - are there any changes coming with spark v2 Ive been following this post http://www.infoobjects.com/spark-submit-with-sbt/ Then again Ive been mainly using docker locally what is decent container for submitting these

SVD output within Spark

2016-07-21 Thread Martin Somers
just looking at a comparision between Matlab and Spark for svd with an input matrix N this is matlab code - yes very small matrix N = 2.5903 -0.04160.6023 -0.12362.55960.7629 0.0148 -0.06930.2490 U = -0.3706 -0.92840.0273 -0.92870.37080

Re: Spark streaming takes longer time to read json into dataframes

2016-07-16 Thread Martin Eden
Hi, I would just do a repartition on the initial direct DStream since otherwise each RDD in the stream has exactly as many partitions as you have partitions in the Kafka topic (in your case 1). Like that receiving is still done in only 1 thread but at least the processing further down is done in p

SparkStreaming multiple output operations failure semantics / error propagation

2016-07-14 Thread Martin Eden
Hi, I have a Spark 1.6.2 streaming job with multiple output operations (jobs) doing idempotent changes in different repositories. The problem is that I want to somehow pass errors from one output operation to another such that in the current output operation I only update previously successful m

Re: DataFrame versus Dataset creation and usage

2016-06-28 Thread Martin Serrano
16 04:57 PM, Xinh Huynh wrote: Hi Martin, Since your schema is dynamic, how would you use Datasets? Would you know ahead of time the row type T in a Dataset[T]? One option is to start with DataFrames in the beginning of your data pipeline, figure out the field types, and then switch completely ov

Re: DataFrame versus Dataset creation and usage

2016-06-24 Thread Martin Serrano
Indeed. But I'm dealing with 1.6 for now unfortunately. On 06/24/2016 02:30 PM, Ted Yu wrote: In Spark 2.0, Dataset and DataFrame are unified. Would this simplify your use case ? On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano mailto:mar...@attivio.com>> wrote: Hi, I'm e

DataFrame versus Dataset creation and usage

2016-06-24 Thread Martin Serrano
en passing data around. Any advice would be appreciated. Thanks, Martin

How does Spark Streaming updateStateByKey or mapWithState scale with state size?

2016-06-23 Thread Martin Eden
Hi all, It is currently difficult to understand from the Spark docs or the materials online that I came across, how the updateStateByKey and mapWithState operators in Spark Streaming scale with the size of the state and how to reason about sizing the cluster appropriately. According to this artic

S3n performance (@AaronDavidson)

2016-04-12 Thread Martin Eden
ver or are they just available under DataBricksCloud? How can we benefit from those improvements? Thanks, Martin P.S. Have not tried S3a.

Re: Direct Kafka input stream and window(…) function

2016-03-29 Thread Martin Soch
jssc.start(); jssc.awaitTermination(); } } When I run this app the pipeline is stopped (or blocked). But if I switch from direct-kafka-stream to (for instance) socket-text-stream the app works as expected. Since this is possible use-case (API allows it) I would like to know wh

Direct Kafka input stream and window(…) function

2016-03-22 Thread Martin Soch
applies when using different type of stream. Is it some known limitation of window(..) function when used with direct-Kafka-input-stream ? Java pseudo code: org.apache.spark.streaming.kafka.DirectKafkaInputDStream s; s.window(Durations.seconds(10)).print(); // the pipeline will stop Thank

Fwd: Connection failure followed by bad shuffle files during shuffle

2016-03-15 Thread Eric Martin
ase so the bug can be more easily investigated? Best, Eric Martin

Problem running JavaDirectKafkaWordCount

2016-03-12 Thread Martin Andreoni
Some help is always welcome, Thanks. -- - MARTIN ANDREONI

Frustration over Spark and Jackson

2016-02-16 Thread Martin Skøtt
version of Jackson, but that would require me to change my common code which I would like to avoid. -- Kind regards Martin

Why does predicate pushdown not work on HiveContext (concrete HiveThriftServer2) ?

2015-10-31 Thread Martin Senne
Hi all, # Programm Sketch I create a HiveContext `hiveContext` With that context, I create a DataFrame `df` from a JDBC relational table.I register the DataFrame `df` viadf.registerTempTable("TESTTABLE")I start a HiveThriftServer2 via HiveThriftServer2.startWithContext(hiveContext) The TESTTABLE

Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Martin Senne
Ted, thx. Should I repost? Am 31.10.2015 17:41 schrieb "Ted Yu" : > From the result of http://search-hadoop.com/?q=spark+Martin+Senne , > Martin's post Tuesday didn't go through. > > FYI > > On Sat, Oct 31, 2015 at 9:34 AM, Nicholas Chammas < > nicholas

Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Martin Senne
. Hope to see get things improved ... Cheers, Martin Am 31.10.2015 17:34 schrieb "Nicholas Chammas" : > Nabble is an unofficial archive of this mailing list. I don't know who > runs it, but it's not Apache. There are often delays between when things > get posted

Sorry, but Nabble and ML suck

2015-10-31 Thread Martin Senne
Having written a post on last Tuesday, I'm still not able to see my post under nabble. And yeah, subscription to u...@apache.spark.org was successful (rechecked a minute ago) Even more, I have no way (and no confirmation) that my post was accepted, rejected, whatever. This is very L4M3 and so 80i

Why is no predicate pushdown performed, when using Hive (HiveThriftServer2) ?

2015-10-28 Thread Martin Senne
Hi all, # Programm Sketch 1. I create a HiveContext `hiveContext` 2. With that context, I create a DataFrame `df` from a JDBC relational table. 3. I register the DataFrame `df` via df.registerTempTable("TESTTABLE") 4. I start a HiveThriftServer2 via HiveThriftServer2.star

Re: HDP 2.3 support for Spark 1.5.x

2015-09-28 Thread Fabien Martin
Hi Krishna, - Take a lokk at *http://hortonworks.com/hadoop-tutorial/apache-spark-1-4-1-technical-preview-with-hdp/ * - Or you can specify your 1.5.x jar as the Spark one using something like : --c

Re: how to handle OOMError from groupByKey

2015-09-28 Thread Fabien Martin
You can try to reduce the number of containers in order to increase their memory. 2015-09-28 9:35 GMT+02:00 Akhil Das : > You can try to increase the number of partitions to get ride of the OOM > errors. Also try to use reduceByKey instead of groupByKey. > > Thanks > Best Regards > > On Sat, Sep

Re: Small File to HDFS

2015-09-03 Thread Martin Menzel
me. Good luck Martin 2015-09-03 16:17 GMT+02:00 : > My main question in case of HAR usage is , is it possible to use Pig on it > and what about performances ? > > - Mail original - > De: "Jörn Franke" > À: nib...@free.fr, user@spark.apache.org > Envoyé

When will window ....

2015-08-10 Thread Martin Senne
When will window functions be integrated into Spark (without HiveContext?) Gesendet mit AquaMail für Android http://www.aqua-mail.com Am 10. August 2015 23:04:22 schrieb Michael Armbrust : You will need to use a HiveContext for window functions to work. On Mon, Aug 10, 2015 at 1:26 PM, Jerry

Re: Spark SQL DataFrame: Nullable column and filtering

2015-08-01 Thread Martin Senne
the moment. Can someone please confirm this! - Alias information is not displayed via DataFrame.printSchema. (or at least I did not find a way of how to) Cheers, Martin 2015-07-31 22:51 GMT+02:00 Martin Senne : > Dear Michael, dear all, > > a minimal example is listed below.

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-31 Thread Martin Senne
e the column "x" (from left DataFrame) with "x" (from right DataFrame) did not work out, as I found no way as use select( $"aliasname.x") really programmatically. Could someone sketch the code? Any help welcome, thanks Martin =

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Martin Senne
, 5) on an inner join. BUT I'm also interested in (1, "hello", null) as there is no counterpart in mapping (this is the left outer join part) I need to distinguish 1 and 2 because of later inserts (case 1, hello) or updates (case 2, bon). Cheers and thanks, Martin Am 30.07.2015 22:58 schrieb

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Martin Senne
is schema modification as to make outer joins work.* Cheers and thanks, Martin 2015-07-30 20:23 GMT+02:00 Michael Armbrust : > We don't yet updated nullability information based on predicates as we > don't actually leverage this information in many places yet. Why do you >

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
, Martin -- Původní zpráva -- Od: Peter Rudenko Komu: zapletal-mar...@email.cz, Sean Owen Datum: 25. 3. 2015 13:28:38 Předmět: Re: Spark ML Pipeline inaccessible types " Hi Martin, here’s 2 possibilities to overcome this: 1) Put your logic into org.apache.spark packa

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
by the real problem I am facing. My issue is that VectorUDT is not accessible by user code and therefore it is not possible to use custom ML pipeline with the existing Predictors (see the last two paragraphs in my first email). Best Regards, Martin -- Původní zpráva -- Od

Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
not yet expected to be used in this way? Thanks, Martin

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Martin Goodson
Have you tried to repartition() your original data to make more partitions before you aggregate? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Mon, Mar 23, 2015 at 4:12 PM, Yiannis Gkoufas wrote: > Hi Yin, > > Yes, I have set spark.executor.memory

Solving linear equations

2014-10-22 Thread Martin Enzinger
Hi, I'm wondering how to use Mllib for solving equation systems following this pattern 2*x1 + x2 + 3*x3 + + xn = 0 x1 + 0*x2 + 3*x3 + + xn = 0 .. .. 0*x1 + x2 + 0*x3 + + xn = 0 I definitely still have some reading to do to really understand the direct solving techn

Re: Avoid broacasting huge variables

2014-09-20 Thread Martin Goodson
-- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- -- Martin Goodson @martingoodson - - To unsubsc

Re: Personalized Page rank in graphx

2014-08-21 Thread Martin Liesenberg
I could take a stab at it, though I'd have some reading up on Personalized PageRank to do, before I'd be able to start coding. If that's OK, I'd get started. Best regards, Martin On 20 August 2014 23:03, Ankur Dave wrote: > At 2014-08-20 10:57:57 -0700, Mohit Singh wrote

Re: Reading from HDFS no faster than reading from S3 - how to tell if data locality respected?

2014-08-04 Thread Martin Goodson
disks is not much faster than accessing s3 across the network? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Fri, Aug 1, 2014 at 10:44 AM, Martin Goodson wrote: > Hi all, > I'm consistently finding that reading from HDFS is not appreciably fa

Reading from HDFS no faster than reading from S3 - how to tell if data locality respected?

2014-08-01 Thread Martin Goodson
educe/samples/spark/1.0.0/install-spark-shark.rb and ami-version 3.1.0). -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1]

Job using Spark for Machine Learning

2014-07-29 Thread Martin Goodson
billion users per month and are second only to Google in the contextual advertising space (ok - a distant second!). Details here: *http://grnh.se/rl8f25 <http://grnh.se/rl8f25>* -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1]

Re: Configuring Spark Memory

2014-07-24 Thread Martin Goodson
Great - thanks for the clarification Aaron. The offer stands for me to write some documentation and an example that covers this without leaving *any* room for ambiguity. -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Thu, Jul 24, 2014 at 6:09 PM, Aaron

Re: Configuring Spark Memory

2014-07-24 Thread Martin Goodson
Thank you Nishkam, I have read your code. So, for the sake of my understanding, it seems that for each spark context there is one executor per node? Can anyone confirm this? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Thu, Jul 24, 2014 at 6:12 AM, Nishkam

Re: Configuring Spark Memory

2014-07-23 Thread Martin Goodson
GB used by Spark.)" Am I reading this incorrectly? Anyway our configuration is 21 machines (one master and 20 slaves) each with 60Gb. We would like to use 4 cores per machine. This is pyspark so we want to leave say 16Gb on each machine for python processes. Thanks again for the advice! --

Configuring Spark Memory

2014-07-23 Thread Martin Goodson
this and the myriad of other memory settings available (daemon memory, worker memory etc). Perhaps a worked example could be added to the docs? I would be happy to provide some text as soon as someone can enlighten me on the technicalities! Thank you -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1]

Re: Problem running Spark shell (1.0.0) on EMR

2014-07-22 Thread Martin Goodson
I am also having exactly the same problem, calling using pyspark. Has anyone managed to get this script to work? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Wed, Jul 16, 2014 at 2:10 PM, Ian Wilkinson wrote: > Hi, > > I’m trying to run the Spa

Re: TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY

2014-07-21 Thread Martin Gammelsæter
unless they are also in the group by clause > or are inside of an aggregate function. > > On Jul 18, 2014 5:12 AM, "Martin Gammelsæter" > wrote: >> >> Hi again! >> >> I am having problems when using GROUP BY on both SQLContext and >> HiveCo

TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY

2014-07-18 Thread Martin Gammelsæter
n(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) What am I doing wrong? -- Best regards, Martin Gammelsæter

Re: Supported SQL syntax in Spark SQL

2014-07-14 Thread Martin Gammelsæter
the cluster using spark-ec2 from the 1.0.1 release, so I’m > assuming that’s taken care of, at least in theory. > > I just spun down the clusters I had up, but I will revisit this tomorrow and > provide the information you requested. > > Nick -- Mvh. Martin Gammelsæter 92209139

"Initial job has not accepted any resources" means many things

2014-07-09 Thread Martin Gammelsæter
addJar every time the app starts up, and instead manually add the jar to the classpath of every worke), but I can't seem to find out how) -- Best regards, Martin Gammelsæter

  1   2   >