Re: How to connect Tableau to databricks spark?

2017-01-08 Thread Jörn Franke
Firewall Ports open? Hint: for security reasons you should not connect via the internet. > On 9 Jan 2017, at 04:30, Raymond Xie wrote: > > I want to do some data analytics work by leveraging Databricks spark platform > and connect my Tableau desktop to it for data visualization. > > Does anyo

Re: Unable to explain the job kicked off for spark.read.csv

2017-01-08 Thread Hyukjin Kwon
Oh, I mean another job would *not* happen if the schema is explicitly given. 2017-01-09 16:37 GMT+09:00 Hyukjin Kwon : > Hi Appu, > > > I believe that textFile and filter came from... > > https://github.com/apache/spark/blob/branch-2.1/sql/ > core/src/main/scala/org/apache/spark/sql/execution/ >

Re: Unable to explain the job kicked off for spark.read.csv

2017-01-08 Thread Hyukjin Kwon
Hi Appu, I believe that textFile and filter came from... https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L59-L61 It needs to read a first line even if using the header is disabled and schema inference is

Re: Docker image for Spark streaming app

2017-01-08 Thread shyla deshpande
Just want to clarify. I want to run a single node spark cluster in a docker container. I want to use Spark 2.0+ and scala. Looking for a docker image from docker hub. I want this setup for development and testing purpose. Please let me know if you know of any docker image that can help me get sta

Re: handling of empty partitions

2017-01-08 Thread Holden Karau
Hi Georg, Thanks for the question along with the code (as well as posting to stack overflow). In general if a question is well suited for stackoverflow its probably better suited to the user@ list instead of the dev@ list so I've cc'd the user@ list for you. As far as handling empty partitions wh

Re: Efficient look up in Key Pair RDD

2017-01-08 Thread ayan guha
It is a hive construct, supported since hive 0.10, so I would be very surprised if Spark does not support itcan't speak for Spark 2.0 (not got a chance to touch it yet :) ) On Mon, Jan 9, 2017 at 2:33 PM, Anil Langote wrote: > Does it support in Spark Dataset 2.0 ? > > > > > > Thank you > >

Re: A note about MLlib's StandardScaler

2017-01-08 Thread Holden Karau
Hi Gilad, Spark uses the sample standard variance inside of the StandardScaler (see https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler ) which I think would explain the results you are seeing you are seeing. I believe the scalers are intended to

How to connect Tableau to databricks spark?

2017-01-08 Thread Raymond Xie
I want to do some data analytics work by leveraging Databricks spark platform and connect my Tableau desktop to it for data visualization. Does anyone ever make it? I've trying to follow the instruction below but not successful? https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20

Re: Efficient look up in Key Pair RDD

2017-01-08 Thread Anil Langote
Hi Ayan Thanks a lot for reply, what is GROUPING SET? I did try GROUP BY with UDAF but it doesn’t perform well. for one combination it takes 1.5 mins in my use case I have 400 combinations which will take ~400 mins I am looking for a solution which will scale on the combinations. Thank y

Re: Efficient look up in Key Pair RDD

2017-01-08 Thread ayan guha
Have you tried something like GROUPING SET? That seems to be the exact thing you are looking for On Mon, Jan 9, 2017 at 12:37 PM, Anil Langote wrote: > Sure. Let me explain you my requirement I have an input file which has > attributes (25) and las column is array of doubles (14500 elements

subsription

2017-01-08 Thread Raymond Xie
** *Sincerely yours,* *Raymond*

Re: Efficient look up in Key Pair RDD

2017-01-08 Thread Anil Langote
Sure. Let me explain you my requirement I have an input file which has attributes (25) and las column is array of doubles (14500 elements in original file) Attribute_0 Attribute_1 Attribute_2 Attribute_3 DoubleArray 5 3 5 3 0.2938933463658645 0.0437040427073041 0.23002681025029648 0.18003221

Re: Efficient look up in Key Pair RDD

2017-01-08 Thread Holden Karau
To start with caching and having a known partioner will help a bit, then there is also the IndexedRDD project, but in general spark might not be the best tool for the job. Have you considered having Spark output to something like memcache? What's the goal of you are trying to accomplish? On Sun,

Efficient look up in Key Pair RDD

2017-01-08 Thread Anil Langote
Hi All, I have a requirement where I wanted to build a distributed HashMap which holds 10M key value pairs and provides very efficient lookups for each key. I tried loading the file into JavaPairedRDD and tried calling lookup method its very slow. How can I achieve very very faster lookup by a gi

Re:Docker image for Spark streaming app

2017-01-08 Thread vvshvv
Hi, I am running spark streaming job using spark jobserver via this image: https://hub.docker.com/r/depend/spark-jobserver/. It works well in standalone (using mesos job does not make progress). Spark jobserver that supports Spark 2.0 has new API that is only suitable for non-streaming jobs as f

Docker image for Spark streaming app

2017-01-08 Thread shyla deshpande
I looking for a docker image that I can use from docker hub for running a spark streaming app with scala and spark 2.0 +. I am new to docker and unable to find one image from docker hub that suits my needs. Please let me know if anyone is using a docker for spark streaming app and share your exper

[META] What to do about "This post has NOT been accepted by the mailing list yet." ?

2017-01-08 Thread dpapathanasiou
I both subscribed to the mailing list and registered with Nabble, and I was able to post my question[1] there yesterday. While it has gotten a few views, there is no answer yet, and I also saw this warning: "This post has NOT been accepted by the mailing list yet." How do I resolve that, so the

Re: Storage history in web UI

2017-01-08 Thread Appu K
@jacek - thanks a lot for the book @joe - looks like the rest api also exposes a few things like /applications/[app-id]/storage/rdd /applications/[app-id]/storage/rdd/[rdd-id] that might perhaps be of interest to you ? http://spark.apache.org/docs/latest/monitoring.html On 9 January 2017 at 1

Re: Storage history in web UI

2017-01-08 Thread Jacek Laskowski
Hi, A possible workaround...Use SparkListener and save the results to a custom sink. After all web UI is a mere bag of SparkListeners + excellent visualizations. Jacek On 3 Jan 2017 4:14 p.m., "Joseph Naegele" wrote: Hi all, Is there any way to observe Storage history in Spark, i.e. which RD

Spark UI - Puzzling “Input Size / Records” in Stage Details

2017-01-08 Thread Appu K
Was trying something basic to understand tasks stages and shuffles a bit better in Spark. The dataset is 256 MB Tried this in zeppelin val tmpDF = spark.read .option("header", "true") .option("delimiter", ",") .option("inferSchema", "true") .csv("s3://l4b-d4t4/wikipedia/pageviews-by-secon

Unable to explain the job kicked off for spark.read.csv

2017-01-08 Thread Appu K
I was trying to create a base-data-frame in an EMR cluster from a csv file using val baseDF = spark.read.csv("s3://l4b-d4t4/wikipedia/pageviews-by-second-tsv”) Omitted the options to infer the schema and specify the header, just to understand what happens behind the screen. The Spark UI shows t

Re: Spark 2.0.2, KyroSerializer, double[] is not registered.

2017-01-08 Thread Sean Owen
Yes, the class Kryo seems to register all of these primitive array classes already. That's been true for a long time. However in KryoSerializer, I see the following. I wonder why double, int and other primitive array types are missing? I can't think of a good reason. Imran do you know? i think you

Re: Spark 2.0.2, KyroSerializer, double[] is not registered.

2017-01-08 Thread Yan Facai
Hi, Owen, it is fixed after registering manually: conf.registerKryoClasses(Array(Class.forName("[D"))) I believe that Kyro (latest version) have supported double[] : addDefaultSerializer(double[].class, DoubleArraySerializer.class);

Re: Spark 2.0.2, KyroSerializer, double[] is not registered.

2017-01-08 Thread Sean Owen
Double[] is not of the same class as double[]. Kryo should already know how to serialize double[], but I doubt Double[] is registered. The error does seem to clearly indicate double[] though. That surprises me. Can you try manually registering it to see if that fixes it? But then I'm not sure why