RE: what is the optimized way to combine multiple dataframes into one dataframe ?

2016-11-15 Thread Shreya Agarwal
If you are reading all these datasets from files in persistent storage, functions like sc.textFile can take folders/patterns as input and read all of the files matching into the same RDD. Then you can convert it to a dataframe. When you say it is time consuming with union, how are you measuring

RE: AVRO File size when caching in-memory

2016-11-15 Thread Shreya Agarwal
(Adding user@spark back to the discussion) Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope for compression. On the other hand avro and parquet are already compressed and just store extra schema info, afaik. Avro and parquet are both going to make your data smaller, par

what is the optimized way to combine multiple dataframes into one dataframe ?

2016-11-15 Thread Devi P.V
Hi all, I have 4 data frames with three columns, client_id,product_id,interest I want to combine these 4 dataframes into one dataframe.I used union like following df1.union(df2).union(df3).union(df4) But it is time consuming for bigdata.what is the optimized way for doing this using spark 2.0

Re: AVRO File size when caching in-memory

2016-11-15 Thread Prithish
Anyone? On Tue, Nov 15, 2016 at 10:45 AM, Prithish wrote: > I am using 2.0.1 and databricks avro library 3.0.1. I am running this on > the latest AWS EMR release. > > On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke wrote: > >> spark version? Are you using tungsten? >> >> > On 14 Nov 2016, at 10:05

Re: Joining to a large, pre-sorted file

2016-11-15 Thread Rohit Verma
You can try coalesce on join statement. val result = master.join(transaction,”key”). coalesce(# number of partitions in master) On Nov 15, 2016, at 8:07 PM, Stuart White mailto:stuart.whi...@gmail.com>> wrote: It seems that the number of files could possibly get out of hand using this approach.

Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-11-15 Thread Elkhan Dadashov
Thanks for the clarification, Marcelo. On Tue, Nov 15, 2016 at 6:20 PM Marcelo Vanzin wrote: > On Tue, Nov 15, 2016 at 5:57 PM, Elkhan Dadashov > wrote: > > This is confusing in the sense that, the client needs to stay alive for > > Spark Job to finish successfully. > > > > Actually the client

Re: Problem submitting a spark job using yarn-client as master

2016-11-15 Thread Rohit Verma
you can set hdfs as defaults, sparksession.sparkContext().hadoopConfiguration().set("fs.defaultFS", “hdfs://master_node:8020”); Regards Rohit On Nov 16, 2016, at 3:15 AM, David Robison mailto:david.robi...@psgglobal.net>> wrote: I am trying to submit a spark job through the yarn-client master

Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-11-15 Thread Marcelo Vanzin
On Tue, Nov 15, 2016 at 5:57 PM, Elkhan Dadashov wrote: > This is confusing in the sense that, the client needs to stay alive for > Spark Job to finish successfully. > > Actually the client can die or finish (in Yarn-cluster mode), and the spark > job will successfully finish. That's an internal

Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-11-15 Thread Elkhan Dadashov
Hi Marcelo, This part of the JaaDoc is confusing: https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java " * In *cluster mode*, this means that the client that launches the * application *must remain alive for the duration of the applica

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
Thanks for the quick response. Its a single XML file and I am using a top level rowTag. So, it creates only one row in a Dataframe with 5 columns. One of these columns will contain most of the data as StructType. Is there a limitation to store data in a cell of a Dataframe? I will check with ne

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Hyukjin Kwon
Hi Arun, I have few questions. Dose your XML file have like few huge documents? In this case of a row having a huge size like (like 500MB), it would consume a lot of memory becuase at least it should hold a row to iterate if I remember correctly. I remember this happened to me before while proc

Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
I am trying to read an XML file which is 1GB is size. I am getting an error 'java.lang.OutOfMemoryError: Requested array size exceeds VM limit' after reading 7 partitions in local mode. In Yarn mode, it throws 'java.lang.OutOfMemoryError: Java heap space' error after reading 3 partitions. Any su

Access broadcast variable from within function passed to reduceByKey

2016-11-15 Thread coolgar
For example: rows.reduceByKey(reduceKeyMapFunction) reduceKeyMapFunction(log1: Map[String, Long], log2: Map[String, Long]): Map[String,Long] = { val bcast = broadcastv.value val countFields = dbh.getCountFields val aggs: Map[String, Long] = Map() countFields.foreach { f =>

Re: sbt shenanigans for a Spark-based project

2016-11-15 Thread Marco Mistroni
Uhm i removed mvn repo and ivy folder as well, sbt seems to kick in but for some reason it cannot 'see' org.apacke.spark-mllib and therefore my compilation fails i have temporarily fixed it by placing the spark-mllib jar in my project \lib directory, perhaps i'll try to create a brand new Spark pro

Problem submitting a spark job using yarn-client as master

2016-11-15 Thread David Robison
I am trying to submit a spark job through the yarn-client master setting. The job gets created and submitted to the clients but immediately errors out. Here is the relevant portion of the log: 15:39:37,385 INFO [org.apache.spark.deploy.yarn.Client] (default task-1) Requesting a new application

Re: Spark Streaming: question on sticky session across batches ?

2016-11-15 Thread Manish Malhotra
Thanks! On Tue, Nov 15, 2016 at 1:07 AM Takeshi Yamamuro wrote: > - dev > > Hi, > > AFAIK, if you use RDDs only, you can control the partition mapping to some > extent > by using a partition key RDD[(key, data)]. > A defined partitioner distributes data into partitions depending on the > key. > A

Re: Cannot find Native Library in "cluster" deploy-mode

2016-11-15 Thread jtgenesis
This link helped solved the issue for me! http://permalink.gmane.org/gmane.comp.lang.scala.spark.user/22700 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-find-Native-Library-in-cluster-deploy-mode-tp28072p28081.html Sent from the Apache Spark User

Re: Spark SQL UDF - passing map as a UDF parameter

2016-11-15 Thread Nirav Patel
Thanks. I tried following versions. They both compiles: val colmap = map(idxMap.flatMap(en => Iterator(lit(en._1), lit(en._2))).toSeq: _*) val colmap = map(idxMap.flatMap(x => x._1 :: x._2 :: Nil).toSeq.map(lit _): _*) However they fail on dataframe action like `show` with org.apache.spark.Spark

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-15 Thread Asher Krim
What language are you using? For Java, you might convert the dataframe to an rdd using something like this: df .toJavaRDD() .map(row -> (SparseVector)row.getAs(row.fieldIndex("columnName"))); On Tue, Nov 15, 2016 at 1:06 PM, Russell Jurney wrote: > I have two dataframes with common feat

Log-loss for multiclass classification

2016-11-15 Thread janardhan shetty
Hi, Best practice for multi class classification technique is to evaluate the model by *log-loss*. Is there any jira or work going on to implement the same in *MulticlassClassificationEvaluator* Currently it supports following : (supports "f1" (default), "weightedPrecision", "weightedRecall", "a

Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-15 Thread Russell Jurney
I have two dataframes with common feature vectors and I need to get the cosine similarity of one against the other. It looks like this is possible in the RDD based API, mllib, but not in ml. So, how do I convert my sparse dataframe vectors into something spark mllib can use? I've searched, but hav

Re: CSV to parquet preserving partitioning

2016-11-15 Thread Daniel Siegmann
Did you try unioning the datasets for each CSV into a single dataset? You may need to put the directory name into a column so you can partition by it. On Tue, Nov 15, 2016 at 8:44 AM, benoitdr wrote: > Hello, > > I'm trying to convert a bunch of csv files to parquet, with the interesting > case

Re: Spark R guidelines for non-spark functions and coxph (Cox Regression for Time-Dependent Covariates)

2016-11-15 Thread Shivaram Venkataraman
I think the answer to this depends on what granularity you want to run the algorithm on. If its on the entire Spark DataFrame and if you except the data frame to be very large then it isn't easy to use the existing R function. However if you want to run the algorithm on smaller subsets of the data

Re: Running stress tests on spark cluster to avoid wild-goose chase later

2016-11-15 Thread Dave Jaffe
Mich- Sparkperf from Databricks (https://github.com/databricks/spark-perf) is a good stress test, covering a wide range of Spark functionality but especially ML. I’ve tested it with Spark 1.6.0 on CDH 5.7. It may need some work for Spark 2.0. Dave Jaffe BigData Performance VMware dja...@vmware

Spark SQL and JDBC compatibility

2016-11-15 Thread herman...@teeupdata.com
Hi Everyone, while reading data into spark 2.0.0 data frames through Calcite JDBC driver, depends on Calcite JDBC connection property setup (lexical), sometimes the data frame query returns empty result set, sometimes it errors out with exception: java.sql.SQLException: Error while preparing st

GraphX updating vertex property

2016-11-15 Thread Saliya Ekanayake
Hi, I have created a property graph using GraphX. Each vertex has an integer array as a property. I'd like to update the values of theses arrays without creating new graph objects. Is this possible in Spark? Thank you, Saliya -- Saliya Ekanayake, Ph.D Applied Computer Scientist Network Dynamic

Running stress tests on spark cluster to avoid wild-goose chase later

2016-11-15 Thread Mich Talebzadeh
Hi, This is rather a broad question. We would like to run a set of stress tests against our Spark clusters to ensure that the build performs as expected before deploying the cluster. Reasoning behind this is that the users were reporting some ML jobs running on two equal clusters reporting back

Re: Cannot find Native Library in "cluster" deploy-mode

2016-11-15 Thread jtgenesis
thanks for the feedback. I tried your suggestion using the --files option, but still no luck. I've also checked and made sure that the libraries the .so files need are also there. When I look at the SparkUI/Environment tab, the `spark.executorEnv.LD_LIBRARY_PATH` is pointing to the correct librari

Re: Finding a Spark Equivalent for Pandas' get_dummies

2016-11-15 Thread neil90
You can have a list of all the columns and pass it to a recursive recursive function to fit and make the transformation. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-a-Spark-Equivalent-for-Pandas-get-dummies-tp28064p28079.html Sent from the Apac

SQL analyzer breakdown

2016-11-15 Thread Koert Kuipers
We see the analyzer break down almost guaranteed when programs get to a certain size or complexity. It starts complaining with messages along the lines of "cannot find column x#255 in list of columns that includes x#255". The workaround is to go to rdd and back. Is there a way to achieve the same (

Re: Joining to a large, pre-sorted file

2016-11-15 Thread Stuart White
It seems that the number of files could possibly get out of hand using this approach. For example, in the job that buckets and writes master, assuming we use the default number of shuffle partitions (200), and assuming that in the next job (the job where we join to transaction), we're also going t

creating a javaRDD using newAPIHadoopFile and FixedLengthInputFormat

2016-11-15 Thread David Robison
I am trying to create a Spark javaRDD using the newAPIHadoopFile and the FixedLengthInputFormat. Here is my code snippit, Configuration config = new Configuration(); config.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, JPEG_INDEX_SIZE); config.set("fs.hdfs.impl", DistributedFileSystem.class.

CSV to parquet preserving partitioning

2016-11-15 Thread benoitdr
Hello, I'm trying to convert a bunch of csv files to parquet, with the interesting case that the input csv files are already "partitioned" by directory. All the input files have the same set of columns. The input files structure looks like : /path/dir1/file1.csv /path/dir1/file2.csv /path/dir2/fi

Exclude certain data from Training Data - Mlib

2016-11-15 Thread Bhaarat Sharma
I have my data in two colors and excluded_colors. colors contains all colors excluded_colors contains some colors that I wish to exclude from my trainingset. I am trying to split the data into a training and testing set and ensure that the colors in excluded_colors are not in my training set but

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-15 Thread Bhaarat Sharma
Thank, Nick. This worked for me. val evaluator = new BinaryClassificationEvaluator(). setLabelCol("label"). setRawPredictionCol("ModelProbability"). setMetricName("areaUnderROC") val auROC = evaluator.evaluate(testResults) On M

Spark R guidelines for non-spark functions and coxph (Cox Regression for Time-Dependent Covariates)

2016-11-15 Thread pietrop
Hi all, I'm writing here after some intensive usage on pyspark and SparkSQL. I would like to use a well known function in the R world: coxph() from the survival package. >From what I understood, I can't parallelize a function like coxph() because it isn't provided with the SparkR package. In other

Straming - Stop when there's no more data

2016-11-15 Thread Ashic Mahtab
I'm using Spark Streaming to process a large number of files (10s of millions) from a single directory in S3. Using sparkContext.textFile or wholeTextFile takes ages and doesn't do anything. Pointing Structured Streaming to that location seems to work, but after processing all the input, it wai

Fwd:

2016-11-15 Thread Anton Okolnychyi
Hi, I have experienced a problem using the Datasets API in Spark 1.6, while almost identical code works fine in Spark 2.0. The problem is related to encoders and custom aggregators. *Spark 1.6 (the aggregation produces an empty map):* implicit val intStringMapEncoder: Encoder[Map[Int, String]]

Simple "state machine" functionality using Scala or Python

2016-11-15 Thread Esa Heikkinen
Hello Can anyone provide a simple example how to implement a "state machine" functionality using Scala or Python in Spark? Sequence of the state machine would be like this: 1) Searches first event of log and its data 2) Based on the data of the first event, searches second event of log and it

Re: Spark Streaming: question on sticky session across batches ?

2016-11-15 Thread Takeshi Yamamuro
- dev Hi, AFAIK, if you use RDDs only, you can control the partition mapping to some extent by using a partition key RDD[(key, data)]. A defined partitioner distributes data into partitions depending on the key. As a good example to control partitions, you can see the GraphX code; https://github.

Re: Spark SQL UDF - passing map as a UDF parameter

2016-11-15 Thread Takeshi Yamamuro
Hi, Literal cannot handle Tuple2. Anyway, how about this? val rdd = sc.makeRDD(1 to 3).map(i => (i, 0)) map(rdd.collect.flatMap(x => x._1 :: x._2 :: Nil).map(lit _): _*) // maropu On Tue, Nov 15, 2016 at 9:33 AM, Nirav Patel wrote: > I am trying to use following API from Functions to convert

Re: How to read a Multi Line json object via Spark

2016-11-15 Thread Hyukjin Kwon
Hi Sree, There is a blog about that, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ It is pretty old but I am sure that it is helpful. Currently, JSON datasource only supports to rest JSON documents formatted according to http://jsonlines.org/ There is an iss