Local vs Cluster

2018-09-14 Thread Aakash Basu
Hi, What is the Spark cluster equivalent of standalone's local[N]. I mean, the value we set as a parameter of local as N, which parameter takes it in the cluster mode? Thanks, Aakash.

Re: Local vs Cluster

2018-09-14 Thread Mich Talebzadeh
Local only one JVM, runs on the host you submitted the job ${SPARK_HOME}/bin/spark-submit \ --master local[N] \ Standalone meaning using Spark own scheduler ${SPARK_HOME}/bin/spark-submit \ --master spark:// \ Where IP_ADDRESS is the host your Spark master sta

Re: Local vs Cluster

2018-09-14 Thread Apostolos N. Papadopoulos
Hi Aakash, in the cluster you need to consider the total number of executors you are using. Please take a look in the following link for an introduction. https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html regards, Apostolos On

Is there any open source framework that converts Cypher to SparkSQL?

2018-09-14 Thread kant kodali
Hi All, Is there any open source framework that converts Cypher to SparkSQL? Thanks!

DAGScheduler in SparkStreaming

2018-09-14 Thread Guillermo Ortiz
A question, if you use Spark Streaming, the DAG is calculated for each microbatch? it's possible to calculate only the first time?

Spark2 DynamicAllocation doesn't release executors that used cache

2018-09-14 Thread Sergejs Andrejevs
Hi, We're starting to use Spark2 with usecases for Dynamic Allocation. However, it was noticed it doesn't work as expected when dataset is cached&uncached (persist&unpersist). The cluster runs with: CDH 5.15.0 Spark 2.3.0 Oracle Java 8.131 The following configs are passed to spark (as well as se

Re: Unsubscribe

2018-09-14 Thread Mohan Palavancha
On Thu, Sep 13, 2018 at 7:47 PM Pekka Lehtonen wrote: > >

What is the best way for Spark to read HDF5@scale?

2018-09-14 Thread kathleen li
Hi, Any Spark-connector for HDF5? The following link does not work anymore? https://www.hdfgroup.org/downloads/spark-connector/ down vo Thanks, Kathleen

Re: Python Dependencies Issue on EMR

2018-09-14 Thread Patrick McCarthy
You didn't say how you're zipping the dependencies, but I'm guessing you either include .egg files or zipped up a virtualenv. In either case, the extra C stuff that scipy and pandas rely upon doesn't get included. An approach like this solved the last problem I had that seemed like this - https://

Re: StackOverflow Error when run ALS with 100 iterations

2018-09-14 Thread LeoB
Just wanted to add a comment to the Jira ticket but I don't think I have permission to do so, so answering here instead. I am encountering the same issue with a stackOverflow Exception. I would like to point out that there is a localCheckpoint

[SparkSQL] Count Distinct issue

2018-09-14 Thread Daniele Foroni
Hi all, I am having some troubles in doing a count distinct over multiple columns. This is an example of my data: ++++---+ |a |b |c |d | ++++---+ |null|null|null|1 | |null|null|null|2 | |null|null|null|3 | |null|null|null|4 | |null|null|null|5 | |null|null|null|

[Spark SQL] Catalyst ScalaReflection/ExpressionEncoder fail with relocated (shaded) classes

2018-09-14 Thread johkelly
Hello, I'm trying to compile google's timestamp.proto protobuf to a scala case class and use it as a field in another proto-derived case class as part of a larger dataset schema. (Although the SQL date type might be preferred in a schema, I encountered this problem when I attempted to use Timestam