Re: Choice of IDE for Spark

2021-10-02 Thread Christian Pfarr
yarn container and of course your spark session. Regards, Christian \ Original-Nachricht Am 2. Okt. 2021, 01:21, Holden Karau schrieb: > > > > Personally I like Jupyter notebooks for my interactive work and then once > I’ve done my exploration I

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Christian Pfarr
Does anyone know where the data for this benchmark was stored? Spark on YARN gets performance because of data locality via co-allocation of YARN Nodemanager and HDFS Datanode, not because of the job scheduler, right? Regards, z0ltrix \ Original-Nachric

unsubscribe

2020-01-17 Thread Christian Acuña

RE: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-26 Thread van den Heever, Christian CC
Hi, How do I get the filename from textFileStream Using streaming. Thanks a mill Standard Bank email disclaimer and confidentiality note Please go to www.standardbank.co.za/site/homepage/emaildisclaimer.html to read our email disclaimer and confidentiality note. Kindly email disclai...@stand

RE: Dose pyspark supports python3.6?

2017-11-01 Thread van den Heever, Christian CC
inference. Support Yarn execution Mlibs can be used in need. Data linage support deu to spar usage. Cons Skills needed to maintain and build In memory cabibility can become bottleneck if not managed No ETL gui. Maybe point be to an article if you have one. Thanks a mill. Christian Standard Bank email

RE: Is Spark suited for this use case?

2017-10-15 Thread van den Heever, Christian CC
Hi, We basically have the same scenario but worldwide as we have bigger Datasets we use OGG --> local --> Sqoop Into Hadoop. By all means you can have spark reading the oracle tables and then do some changes to data in need which will not be done on scoop qry. Ie fraudulent detection on transac

[ANNOUNCE] Apache Bahir 2.1.0 Released

2017-02-22 Thread Christian Kadner
Apache Bahir and to download the latest release go to: http://bahir.apache.org The Apache Bahir streaming connectors are also available at: https://spark-packages.org/?q=bahir --- Best regards, Christian Kadner

[ANNOUNCE] Apache Bahir 2.0.2

2017-01-28 Thread Christian Kadner
download the latest release go to: http://bahir.apache.org The Apache Bahir streaming connectors are also available at: https://spark-packages.org/?q=bahir --- Best regards, Christian Kadner - To unsubscribe e-mail: user

Re: Spark SQL Nested Array of JSON with empty field

2016-06-03 Thread Christian Hellström
If that's your JSON file, then the first problem is that it's incorrectly formatted. Apart from that you can just read the JSON into a DataFrame with sqlContext.read.json() and then select directly on the DataFrame without having to register a temporary table: jsonDF.select("firstname", "address.s

RE: Spark Streaming heap space out of memory

2016-05-30 Thread Dancuart, Christian
at for the problematic classes. From: Shahbaz [mailto:shahzadh...@gmail.com] Sent: 2016, May, 30 3:25 PM To: Dancuart, Christian Cc: user Subject: Re: Spark Streaming heap space out of memory Hi Christian, * What is the processing time of each of your Batch,is it exceeding 15 seconds

Re: problem about RDD map and then saveAsTextFile

2016-05-27 Thread Christian Hellström
Internally, saveAsTextFile uses saveAsHadoopFile: https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala . The final bit in the method first creates the output path and then saves the data set. However, if there

REST-API for Killing a Streaming Application

2016-03-24 Thread Christian Kurz
ess to the spark-submit OS process. Any thoughts are much appreciated, Christian

Re: Spark Streaming - stream between 2 applications

2015-11-21 Thread Christian
am I wrong? > How would you use Kafka here? > > On Fri, Nov 20, 2015 at 7:12 PM, Christian wrote: > >> Have you considered using Kafka? >> >> On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa >> wrote: >> >>> Hi, >>> >>> I have a ba

Re: Spark Streaming - stream between 2 applications

2015-11-20 Thread Christian
Have you considered using Kafka? On Fri, Nov 20, 2015 at 6:48 AM Saiph Kappa wrote: > Hi, > > I have a basic spark streaming application like this: > > « > ... > > val ssc = new StreamingContext(sparkConf, Duration(batchMillis)) > val rawStreams = (1 to numStreams).map(_ => > ssc.rawSocketStrea

Re: Spark RDD cache persistence

2015-11-05 Thread Christian
that Spark supports.. Hopefully, that gives you a place to start. On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma wrote: > Thanks Christian. > So is there any inbuilt mechanism in spark or api integration to other > inmemory cache products such as redis to load the RDD to these system upon

Re: Spark RDD cache persistence

2015-11-05 Thread Christian
The cache gets cleared out when the job finishes. I am not aware of a way to keep the cache around between jobs. You could save it as an object file to disk and load it as an object file on your next job for speed. On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma wrote: > Hi All > I am confused on RD

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Christian
Let me rephrase. Emr cost is about twice as much as the spot price, making it almost 2/3 of the overall cost. On Thu, Nov 5, 2015 at 11:50 AM Christian wrote: > Hi Johnathan, > > We are using EMR now and it's costing way too much. We do spot pricing and > the emr addon cost

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Christian
Hi Johnathan, We are using EMR now and it's costing way too much. We do spot pricing and the emr addon cost is about 2/3 the price of the actual spot instance. On Thu, Nov 5, 2015 at 11:31 AM Jonathan Kelly wrote: > Christian, > > Is there anything preventing you from using E

Spark EC2 script on Large clusters

2015-11-05 Thread Christian
tarts? Thanks for your time, Christian

streaming and piping to R, sending all data in window to pipe()

2015-07-17 Thread PAULI, KEVIN CHRISTIAN [AG-Contractor/1000]
Spark newbie here, using Spark 1.3.1. I’m consuming a stream and trying to pipe the data from the entire window to R for analysis. The R algorithm needs the entire dataset from the stream (everything in the window) in order to function properly; it can’t be broken up. So I tried doing a coales

Re: Super slow caching in 1.3?

2015-04-27 Thread Christian Perez
atabricks.com] > Sent: Thursday, April 16, 2015 7:23 PM > To: Evo Eftimov > Cc: Christian Perez; user > > > Subject: Re: Super slow caching in 1.3? > > > > Here are the types that we specialize, other types will be much slower. > This is only for Spark SQL, normal RDD

Re: Pyspark where do third parties libraries need to be installed under Yarn-client mode

2015-04-24 Thread Christian Perez
To run MLlib, you only need numpy on each node. For additional dependencies, you can call the spark-submit with --py-files option and add the .zip or .egg. https://spark.apache.org/docs/latest/submitting-applications.html Cheers, Christian On Fri, Apr 24, 2015 at 1:56 AM, Hoai-Thu Vuong wrote

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-23 Thread Christian S. Perone
> > On Tue, Apr 21, 2015 at 10:36 AM, Christian S. Perone > wrote: > > Hi Sean, thanks for the answer. I tried to call repartition() on the > input > > with many different sizes and it still continues to show that warning > > message. > > > > On Tue, Apr 21

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-21 Thread Christian S. Perone
maller tasks? > > On Tue, Apr 21, 2015 at 2:56 AM, Christian S. Perone > wrote: > > I keep seeing these warnings when using trainImplicit: > > > > WARN TaskSetManager: Stage 246 contains a task of very large size (208 > KB). > > The maximum recommended task size

MLlib - Collaborative Filtering - trainImplicit task size

2015-04-20 Thread Christian S. Perone
I keep seeing these warnings when using trainImplicit: WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). The maximum recommended task size is 100 KB. And then the task size starts to increase. Is this a known issue ? Thanks ! -- Blog

Re: MLlib -Collaborative Filtering

2015-04-19 Thread Christian S. Perone
The easiest way to do that is to use a similarity metric between the different user factors. On Sat, Apr 18, 2015 at 7:49 AM, riginos wrote: > Is there any way that i can see the similarity table of 2 users in that > algorithm? by that i mean the similarity between 2 users > > > > -- > View this

Re: Super slow caching in 1.3?

2015-04-16 Thread Christian Perez
, Christian On Mon, Apr 6, 2015 at 6:17 PM, Michael Armbrust wrote: > Do you think you are seeing a regression from 1.2? Also, are you caching > nested data or flat rows? The in-memory caching is not really designed for > nested data and so performs pretty slowly here (its just fallin

Super slow caching in 1.3?

2015-04-06 Thread Christian Perez
Hi all, Has anyone else noticed very slow time to cache a Parquet file? It takes 14 s per 235 MB (1 block) uncompressed node local Parquet file on M2 EC2 instances. Or are my expectations way off... Cheers, Christian -- Christian Perez Silicon Valley Data Science Data Analyst christ

Re: input size too large | Performance issues with Spark

2015-04-02 Thread Christian Perez
we wan to use Spark to provide us the capability to process our >> in-memory data structure very fast as well as scale to a larger volume >> when >> required in the future. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-l

Re: persist(MEMORY_ONLY) takes lot of time

2015-04-02 Thread Christian Perez
gt; http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: use

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-20 Thread Christian Perez
Any other users interested in a feature DataFrame.saveAsExternalTable() for making _useful_ external tables in Hive, or am I the only one? Bueller? If I start a PR for this, will it be taken seriously? On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez wrote: > Hi Yin, > > Thank

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Christian Perez
the > improvement on the output of DESCRIBE statement. > > On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai wrote: >> >> Hi Christian, >> >> Your table is stored correctly in Parquet format. >> >> For saveAsTable, the table created is not a Hive table, but

saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Christian Perez
alized properly on receive. I'm tracing execution through source code... but before I get any deeper, can anyone reproduce this behavior? Cheers, Christian -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.com @cp_phd -

SV: Pyspark Hbase scan.

2015-03-12 Thread Castberg , René Christian
?Sorry forgot to attach traceback. Regards Rene Castberg Fra: Castberg, René Christian Sendt: 13. mars 2015 07:13 Til: user@spark.apache.org Kopi: gen tang Emne: SV: Pyspark Hbase scan. ?Hi, I have now successfully managed to test this in a local spark

SV: Pyspark Hbase scan.

2015-03-12 Thread Castberg , René Christian
__ Fra: gen tang Sendt: 5. februar 2015 11:38 Til: Castberg, René Christian Kopi: user@spark.apache.org Emne: Re: Pyspark Hbase scan. Hi, In fact, this pull https://github.com/apache/spark/pull/3920 is to do Hbase scan. However, it is not merged yet. You can also take a look at the example c

New guide on how to write a Spark job in Clojure

2015-02-24 Thread Christian Betz
Hi all, Maybe some of you are interested: I wrote a new guide on how to start using Spark from Clojure. The tutorial covers * setting up a project, * doing REPL- or Test Driven Development of Spark jobs * Running Spark jobs locally. Just read it on https://gorillalabs.github.io/spa

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Christian Betz
Hi Regarding the Cassandra Data model, there's an excellent post on the ebay tech blog: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/. There's also a slideshare for this somewhere. Happy hacking Chris Von: Franc Carter mailto:franc.car...@rozettatech.

Pyspark Hbase scan.

2015-02-05 Thread Castberg , René Christian
?Hi, I am trying to do a hbase scan and read it into a spark rdd using pyspark. I have successfully written data to hbase from pyspark, and been able to read a full table from hbase using the python example code. Unfortunately I am unable to find any example code for doing an HBase scan and rea

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Christian Chua
Is 1.0.8 working for you ? You indicated your last known good version is 1.0.0 Maybe we can track down where it broke. > On Sep 16, 2014, at 12:25 AM, Paul Wais wrote: > > Thanks Christian! I tried compiling from source but am still getting the > same hadoop client versio

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-15 Thread Christian Chua
vior where spark-submit --master yarn-cluster ... will work, but spark-submit --master yarn-client ... will fail. But on the personal build obtained from the command above, both will then work. -Christian On Sep 15, 2014, at 6:28 PM, Paul Wais wrote: > Dear List, > >

Re: K-NN by efficient sparse matrix product

2014-05-28 Thread Christian Jauvin
nished. In theory, this has complexity max(nnz(L)*log p, nnz(L)*n/p). I >> have to warn though: when I played with matrix multiplication, I was getting >> nowhere near serial performance. >> >> >> On Wed, May 28, 2014 at 11:00 AM, Christian Jauvin >> wrote: &g

K-NN by efficient sparse matrix product

2014-05-28 Thread Christian Jauvin
t; way, i.e. each compute node can simply process a slice of the the rows. Would there be a way to do something similar (or related) with Spark? Christian

Re: Job aborted: Spark cluster looks down

2014-03-06 Thread Christian
Hello, has anyone found this problem before? I am sorry to insist but I can not guess what is happening. Should I ask to the dev mailing list? Many thanks in advance. El 05/03/2014 23:57, "Christian" escribió: > I have deployed a Spark cluster in standalone mode with 3 machin

Job aborted: Spark cluster looks down

2014-03-05 Thread Christian
0 :::192.168.1.4:57297:::* LISTEN 7543/java I am completely blocked at this, any help would be very helpful to me. Many thanks in advance. Christian spark-cperez-org.apache.spark.deploy.master.Master-1-node1.out Description: Binary data spark-cperez-org.apache.spark.dep

Re: pyspark and Python virtual enviroments

2014-03-05 Thread Christian
Thanks Bryn. On Wed, Mar 5, 2014 at 9:00 PM, Bryn Keller wrote: > Hi Christian, > > The PYSPARK_PYTHON environment variable specifies the python executable to > use for pyspark. You can put the path to a virtualenv's python executable > and it will work fine. Remember you h

pyspark and Python virtual enviroments

2014-03-05 Thread Christian
://virtualenv.readthedocs.org/en/latest/virtualenv.html Thanks in advance, Christian