Sorting the RDD

2016-03-02 Thread Angel Angel
Hello Sir/Madam, I am try to sort the RDD using *sortByKey* function but i am getting the following error. My code is 1) convert the rdd array into key value pair. 2) after that sort by key but i am getting the error *No implicit Ordering defined for any * [image: Inline image 1] thanks

Re: select count(*) return wrong row counts

2016-03-02 Thread Mich Talebzadeh
This works fine scala> sql("use oraclehadoop") res1: org.apache.spark.sql.DataFrame = [result: string] scala> sql("select count(1) from sales").show +---+ |_c0| +---+ |4991761| +---+ You can do "select count(*) from tablename") as it is not dynamic sql. Does it actually work? Sin

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Xiaoye Sun
Hi Jeff and Prabhu, Thanks for your help. I look deep in the nodemanager log and I found that I have a error message like this: 2016-03-02 03:13:59,692 ERROR org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: error opening leveldb file file:/data/yarn/cache/yarn/nm-local-dir/registere

Re: Spark sql query taking long time

2016-03-02 Thread Ted Yu
Have you seen the thread 'Filter on a column having multiple values' where Michael gave this example ? https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/107522969592/2840265927289860/2388bac36e.html FYI On Wed, Mar 2, 2016 at 9:3

Spark sql query taking long time

2016-03-02 Thread Angel Angel
Hello Sir/Madam, I am writing one application using spark sql. i made the vary big table using the following command *val dfCustomers1 = sc.textFile("/root/Desktop/database.txt").map(_.split(",")).map(p => Customer1(p(0),p(1).trim.toInt, p(2).trim.toInt, p(3)))toDF* Now i want to search the ad

Re: Fair scheduler pool details

2016-03-02 Thread Mark Hamstra
If I'm understanding you correctly, then you are correct that the fair scheduler doesn't currently do everything that you want to achieve. Fair scheduler pools currently can be configured with a minimum number of cores that they will need before accepting Tasks, but there isn't a way to restrict a

Re: Using netlib-java in Spark 1.6 on linux

2016-03-02 Thread Sean Owen
This is really more a netlib question, but I'd guess strongly that you haven't installed libgfortran on your machines. OS X doesn't need it; netlib can't provide it though. On Thu, Mar 3, 2016 at 1:06 AM, cindymc wrote: > I want to take advantage of the Breeze linear algebra libraries, built on >

Re: Spark 1.5 on Mesos

2016-03-02 Thread Tim Chen
You shouldn't need to specify --jars at all since you only have one jar. The error is pretty odd as it suggests it's trying to load /opt/spark/Example but that doesn't really seem to be anywhere in your image or command. Can you paste your stdout from the driver task launched by the cluster dispa

RE: Stage contains task of large size

2016-03-02 Thread Silvio Fiorito
One source of this could be more than you intended (or realized) getting serialized as part of your operations. What are the transformations you’re using? Are you referencing local instance variables in your driver app, as part of your transformations? You may have a large collection for insta

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Benjamin Kim
I want to ask about something related to this. Does anyone know if there is or will be a command line equivalent of spark-shell client for Livy Spark Server or any other Spark Job Server? The reason that I am asking spark-shell does not handle multiple users on the same server well. Since a Spa

convert SQL multiple Join in Spark

2016-03-02 Thread Vikash Kumar
I have to write or convert below SQL query into spark/scala. Anybody can suggest how to implement this in Spark? SELECT a.PERSON_ID as RETAINED_PERSON_ID, a.PERSON_ID, a.PERSONTYPE, 'y' as HOLDOUT,

Stage contains task of large size

2016-03-02 Thread Bijuna
Spark users, We are running spark application in standalone mode. We see warn messages in the logs which says Stage 46 contains a task of very large size (983 KB) . The maximum recommended task size is 100 KB. What is the recommended approach to fix this warning. Please let me know. Thank yo

Re: Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Prashant Sharma
*This is a known issue. * https://issues.apache.org/jira/browse/SPARK-3200 Prashant Sharma On Thu, Mar 3, 2016 at 9:01 AM, Rahul Palamuttam wrote: > Thank you Jeff. > > I have filed a JIRA under the following link : > > https://issues.apache.org/jira/browse/SPARK-13634 > > For some reason th

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Prabhu Joseph
Is all NodeManager services restarted after the change in yarn-site.xml On Thu, Mar 3, 2016 at 6:00 AM, Jeff Zhang wrote: > The executor may fail to start. You need to check the executor logs, if > there's no executor log then you need to check node manager log. > > On Wed, Mar 2, 2016 at 4:26 P

Re: Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Rahul Palamuttam
Thank you Jeff. I have filed a JIRA under the following link : https://issues.apache.org/jira/browse/SPARK-13634 For some reason the spark context is being pulled into the referencing environment of the closure. I also had no problems with batch jobs. On Wed, Mar 2, 2016 at 7:18 PM, Jeff Zhang

Re: Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Jeff Zhang
I can reproduce it in spark-shell. But it works for batch job. Looks like spark repl issue. On Thu, Mar 3, 2016 at 10:43 AM, Rahul Palamuttam wrote: > Hi All, > > We recently came across this issue when using the spark-shell and zeppelin. > If we assign the sparkcontext variable (sc) to a new va

Re: rdd cache name

2016-03-02 Thread charles li
thanks a lot, Xinh, that's very helpful for me. On Thu, Mar 3, 2016 at 12:54 AM, Xinh Huynh wrote: > Hi Charles, > > You can set the RDD name before using it. Just do before caching: > (Scala) myRdd.setName("Charles RDD") > (Python) myRdd.setName('Charles RDD') > Reference: PySpark doc: > http:/

Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Rahul Palamuttam
Hi All, We recently came across this issue when using the spark-shell and zeppelin. If we assign the sparkcontext variable (sc) to a new variable and reference another variable in an RDD lambda expression we get a task not serializable exception. The following three lines of code illustrate this

select count(*) return wrong row counts

2016-03-02 Thread Jesse F Chen
I am finding a strange issue with Spark SQL where "select count(*) " returns wrong row counts for certain tables. I am using TPCDS tables, so here are the actual counts: Row count

Re: Fair scheduler pool details

2016-03-02 Thread Eugene Morozov
Mark, I'm trying to configure spark cluster to share resources between two pools. I can do that by assigning minimal shares (it works fine), but that means specific amount of cores is going to be wasted by just being ready to run anything. While that's better, than nothing, I'd like to specify pe

Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni
See below and Attached the Dockerfile to build the spark image ( between i just upgraded to 1.6 ) I am running below setup - Mesos Master - Docker Container Mesos Slave 1 - Docker Container Mesos Slave 2 - Docker Container Marathon - Docker Container Spark MESOS Dispatcher - Docker Cont

Re: Mapper side join with DataFrames API

2016-03-02 Thread Deepak Gopalakrishnan
Hello, I'm using 1.6.0 on EMR On Thu, Mar 3, 2016 at 12:34 AM, Yong Zhang wrote: > What version of Spark you are using? > > I am also trying to figure out how to do the map side join in Spark. > > In 1.5.x, there is a broadcast function in the Dataframe, and it caused > OOM for me simple test c

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Michael Armbrust
Note that if you specify the schema that you expect when reading JSON you basically get the "relaxed" mode that you are asking for. Records that don't match will end up with nulls. The problem here is Spark SQL knows that the operation you are asking for is invalid given the set of data you let i

Using netlib-java in Spark 1.6 on linux

2016-03-02 Thread cindymc
I want to take advantage of the Breeze linear algebra libraries, built on netlib-java, used heavily by SparkML. I've found this amazingly time-consuming to figure out, and have only been able to do so on MacOS. I want to do same on Linux: $ uname -a Linux slc10whv 3.8.13-68.3.4.el6uek.x86_64 #2 S

Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
I think this could be the reason : DataFrame sorts the column of each record lexicographically if we do a *select **. So, if we wish to maintain a specific column ordering while processing we should use do *select col1, col2...* instead of select *. However, this is just what I feel. Let's wait f

Spark job on YARN ApplicationMaster DEBUG log

2016-03-02 Thread Prabhu Joseph
Hi All, I am trying to add DEBUG for Spark ApplicationMaster for it is not working. On running Spark job, passed -Dlog4j.configuration=file:/opt/mapr/spark/spark-1.4.1/conf/log4j.properties The log4j.properties has log4j.rootCategory=DEBUG, console Spark Executor Containers has DEBUG logs but

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-03-02 Thread Koert Kuipers
with the locality issue resolved, i am still struggling with the new memory management. i am seeing tasks on tiny amounts of data take 15 seconds, of which 14 are spend in GC. with the legacy memory management (spark.memory.useLegacyMode = false ) they complete in 1 - 2 seconds. since we are perm

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Jeff Zhang
The executor may fail to start. You need to check the executor logs, if there's no executor log then you need to check node manager log. On Wed, Mar 2, 2016 at 4:26 PM, Xiaoye Sun wrote: > Hi all, > > I am very new to spark and yarn. > > I am running a BroadcastTest example application using spa

Re: getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Andy Sloane
Done, thanks. https://issues.apache.org/jira/browse/SPARK-13631 Will continue discussion there. On Wed, Mar 2, 2016 at 4:09 PM Shixiong(Ryan) Zhu wrote: > I think it's a bug. Could you open a ticket here: > https://issues.apache.org/jira/browse/SPARK > > On Wed, Mar 2, 2016 at 3:46 PM, Andy Sl

Re: getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Shixiong(Ryan) Zhu
I think it's a bug. Could you open a ticket here: https://issues.apache.org/jira/browse/SPARK On Wed, Mar 2, 2016 at 3:46 PM, Andy Sloane wrote: > We are seeing something that looks a lot like a regression from spark 1.2. > When we run jobs with multiple threads, we have a crash somewhere inside

Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Cool. Here is it how it goes... I am reading Avro objects from a Kafka topic as a DStream, converting it into a DataFrame so that I can filter out records based on some conditions and finally do some aggregations on these filtered records. During the process I also need to tag each record based on

Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Sainath Palla
Hi Tariq, Can you tell in brief what kind of operation you have to do? I can try helping you out with that. In general, if you are trying to use any group operations you can use window operations. On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq wrote: > Hi Sainath, > > Thank you for the prompt r

getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Andy Sloane
We are seeing something that looks a lot like a regression from spark 1.2. When we run jobs with multiple threads, we have a crash somewhere inside getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs instea

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
We’re veering off from the original question of this thread, but to clarify, my comment earlier was this: So in short, DataFrames are the “new RDD”—i.e. the new base structure you should be using in your Spark programs wherever possible. RDDs are not going away, and clearly in your case DataFrame

Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Hi Sainath, Thank you for the prompt response! Could you please elaborate your answer a bit? I'm sorry I didn't quite get this. What kind of operation I can perform using SQLContext? It just helps us during things like DF creation, schema application etc, IMHO. [image: http://] Tariq, Mohamma

Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Sainath Palla
Instead of collecting the data frame, you can try using a sqlContext on the data frame. But it depends on what kind of operations are you trying to perform. On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq wrote: > Hi list, > > *Scenario :* > I am creating a DStream by reading an Avro object from

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Darren Govoni
Our data is made up of single text documents scraped off the web. We store these in a  RDD. A Dataframe or similar structure makes no sense at that point. And the RDD is transient. So my point is. Dataframes should not replace plain old rdd since rdds allow for more flexibility and sql etc

Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Hi list, *Scenario :* I am creating a DStream by reading an Avro object from a Kafka topic and then converting it into a DataFrame to perform some operations on the data. I call DataFrame.collect() and perform the intended operation on each Row of Array[Row] returned by DataFrame.collect(). *Prob

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread ayan guha
+1 on all the pointers. @Darren - it would probably good idea to explain your scenario a little more in terms of structured vs un-structured datasets. Then people here can give you better input on how you can use DF. On Thu, Mar 3, 2016 at 9:43 AM, Nicholas Chammas wrote: > Plenty of people ge

Re: Mllib Logistic Regression performance relative to Mahout

2016-03-02 Thread raj.kumar
Thanks Yashwanth, Our features are a mixture of categoric and numeric features. I convert categoric-features into numeric-features with the standard techniques such as one-hot encoding. -Raj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Logistic-R

Re: Spark 1.5 on Mesos

2016-03-02 Thread Charles Allen
@Tim yes, this is asking about 1.5 though On Wed, Mar 2, 2016 at 2:35 PM Tim Chen wrote: > Hi Charles, > > I thought that's fixed with your patch in latest master now right? > > Ashish, yes please give me your docker image name (if it's in the public > registry) and what you've tried and I can s

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
Plenty of people get their data in Parquet, Avro, or ORC files; or from a database; or do their initial loading of un- or semi-structured data using one of the various data source libraries which help with type-/schema-inference. All of th

Re: Spark 1.5 on Mesos

2016-03-02 Thread Tim Chen
Hi Charles, I thought that's fixed with your patch in latest master now right? Ashish, yes please give me your docker image name (if it's in the public registry) and what you've tried and I can see what's wrong. I think it's most likely just the configuration of where the Spark home folder is in

Re: Spark 1.5 on Mesos

2016-03-02 Thread Charles Allen
Re: Spark on Mesos Warning regarding disk space: https://issues.apache.org/jira/browse/SPARK-12330 That's a spark flaw I encountered on a very regular basis on mesos. That and a few other annoyances are fixed in https://github.com/metamx/spark/tree/v1.5.2-mmx Here's another mild annoyance I'v

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Darren Govoni
Dataframes are essentially structured tables with schemas. So where does the non typed data sit before it becomes structured if not in a traditional RDD? For us almost all the processing comes before there is structure to it. Sent from my Verizon Wireless 4G LTE smartphone Orig

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
> However, I believe, investing (or having some members of your group) learn and invest in Scala is worthwhile for few reasons. One, you will get the performance gain, especially now with Tungsten (not sure how it relates to Python, but some other knowledgeable people on the list, please chime in).

Re: Spark Streaming 1.6 mapWithState not working well with Kryo Serialization

2016-03-02 Thread Shixiong(Ryan) Zhu
See https://issues.apache.org/jira/browse/SPARK-12591 After applying the patch, it should work. However, if you want to enable "registrationRequired", you still need to register "org.apache.spark.streaming.util.OpenHashMapBasedStateMap", "org.apache.spark.streaming.util.EmptyStateMap" and "org.apa

AVRO vs Parquet

2016-03-02 Thread Timothy Spann
Which format is the best format for SparkSQL adhoc queries and general data storage? There are lots of specialized cases, but generally accessing some but not all the available columns with a reasonable subset of the data. I am learning towards Parquet as it has great support in Spark. I also

Spark Streaming 1.6 mapWithState not working well with Kryo Serialization

2016-03-02 Thread Aris
Hello Spark folks and especially TD, I am using the Spark Streaming 1.6 mapWithState API, and I am trying to enforce Kryo Serialization with SparkConf.set("spark.kryo.registrationRequired", "true") However, this appears to be impossible! I registered all the classes that are my own, but I proble

Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni
I have no luck and i would to ask the question to spark committers will this be ever designed to run on mesos ? spark app as a docker container not working at all on mesos ,if any one would like the code i can send it over to have a look. Ashish On Wed, Mar 2, 2016 at 12:23 PM, Sathish Kumaran

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Guru Medasani
Hi Yanlin, This is a fairly new effort and is not officially released/supported by Cloudera yet. I believe those numbers will be out once it is released. Guru Medasani gdm...@gmail.com > On Mar 2, 2016, at 10:40 AM, yanlin wang wrote: > > Did any one use Livy in real world high concurrency

Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Xiaoye Sun
Hi all, I am very new to spark and yarn. I am running a BroadcastTest example application using spark 1.6.0 and Hadoop/Yarn 2.7.1. in a 5 nodes cluster. I configured my configuration files according to https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation 1. copy

Avro SerDe Issue w/ Manual Partitions?

2016-03-02 Thread Chris Miller
Hi, I have a strange issue occurring when I use manual partitions. If I create a table as follows, I am able to query the data with no problem: CREATE TABLE test1 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.Avr

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
Thanks. Once you create the jira just reply to this email with the link. On Wednesday, March 2, 2016, Ewan Leith wrote: > Thanks, I'll create the JIRA for it. Happy to help contribute to a patch if > we can, not sure if my own scala skills will be up to it but perhaps one of > my colleagues' w

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
Thanks, I'll create the JIRA for it. Happy to help contribute to a patch if we can, not sure if my own scala skills will be up to it but perhaps one of my colleagues' will :) Ewan I don't think that exists right now, but it's definitely a good option to have. I myself have run into this issue

Re: SFTP Compressed CSV into Dataframe

2016-03-02 Thread Sumedh Wale
On Thursday 03 March 2016 12:47 AM, Benjamin Kim wrote: I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV file? I am able to download the file first locally using the SFTP Client in the spark-sftp package. Then, I load the file into a dataframe using the spark-csv packag

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
I don't think that exists right now, but it's definitely a good option to have. I myself have run into this issue a few times. Can you create a JIRA ticket so we can track it? Would be even better if you are interested in working on a patch! Thanks. On Wed, Mar 2, 2016 at 11:51 AM, Ewan Leith w

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
Hi Reynold, yes that would be perfect for our use case. I assume it doesn't exist though, otherwise I really need to go re-read the docs! Thanks to both of you for replying by the way, I know you must be hugely busy. Ewan Are you looking for "relaxed" mode that simply return nulls for fields t

Re: SFTP Compressed CSV into Dataframe

2016-03-02 Thread Ewan Leith
The Apache Commons library will let you access files on an SFTP server via a Java library, no local file handling involved https://commons.apache.org/proper/commons-vfs/filesystems.html Hope this helps, Ewan I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV file? I am a

Re: SFTP Compressed CSV into Dataframe

2016-03-02 Thread Holden Karau
So doing a quick look through the README & code for spark-sftp it seems that the way this connector works is by downloading the file locally on the driver program and this is not configurable - so you would probably need to find a different connector (and you probably shouldn't use spark-sftp for l

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
Are you looking for "relaxed" mode that simply return nulls for fields that doesn't exist or have incompatible schema? On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith wrote: > Thanks Michael, it's not a great example really, as the data I'm working with > has some source files that do fit the sche

SFTP Compressed CSV into Dataframe

2016-03-02 Thread Benjamin Kim
I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV file? I am able to download the file first locally using the SFTP Client in the spark-sftp package. Then, I load the file into a dataframe using the spark-csv package, which automatically decompresses the file. I just want

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
Thanks Michael, it's not a great example really, as the data I'm working with has some source files that do fit the schema, and some that don't (out of millions that do work, perhaps 10 might not). In an ideal world for us the select would probably return the valid records only. We're trying o

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread radoburansky
I am sure you have googled this: https://github.com/holdenk/spark-testing-base On Wed, Mar 2, 2016 at 6:54 PM, SRK [via Apache Spark User List] < ml-node+s1001560n2638...@n3.nabble.com> wrote: > Hi, > > What is a good unit testing framework for Spark batch/streaming jobs? I > have core spark, spa

Re: spark streaming

2016-03-02 Thread Vinti Maheshwari
Thanks Shixiong. Sure. Please find the details: Spark-version: 1.5.2 I am doing data aggregation using check pointing, not sure if this is causing issue. Also, i am using perl_kafka producer to push data to kafka and then my spark program is reading it from kafka. Not sure, if i need to use create

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Ricardo Paiva
I use the plain and old Junit Spark batch example: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext import org.junit.AfterClass import org.junit.Assert.assertEquals import org.junit.BeforeClass import org.junit.Test object TestMyCode {

Re: spark streaming

2016-03-02 Thread Shixiong(Ryan) Zhu
Hey, KafkaUtils.createDirectStream doesn't need a StorageLevel as it doesn't store blocks to BlockManager. However, the error is not related to StorageLevel. It may be a bug. Could you provide more info about it? E.g., Spark version, your codes, logs. On Wed, Mar 2, 2016 at 3:02 AM, Vinti Maheshw

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Silvio Fiorito
Please check out the following for some good resources: https://github.com/holdenk/spark-testing-base https://spark-summit.org/east-2016/events/beyond-collect-and-parallelize-for-tests/ On 3/2/16, 12:54 PM, "SRK" wrote: >Hi, > >What is a good unit testing framework for Spark batch/streami

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Yin Yang
Cycling prior bits: http://search-hadoop.com/m/q3RTto4sby1Cd2rt&subj=Re+Unit+test+with+sqlContext On Wed, Mar 2, 2016 at 9:54 AM, SRK wrote: > Hi, > > What is a good unit testing framework for Spark batch/streaming jobs? I > have > core spark, spark sql with dataframes and streaming api getting

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Michael Armbrust
-dev +user StructType(StructField(data,ArrayType(StructType(StructField( > *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true), > StructField(name,StringType,true)),true),true), StructField(othertype, > ArrayType(StructType(StructField(company,String

Unit testing framework for Spark Jobs?

2016-03-02 Thread SRK
Hi, What is a good unit testing framework for Spark batch/streaming jobs? I have core spark, spark sql with dataframes and streaming api getting used. Any good framework to cover unit tests for these APIs? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabbl

Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Xiaoye Sun
Hi all, I am very new to spark and yarn. I am running a BroadcastTest example application using spark 1.6.0 and Hadoop/Yarn 2.7.1. in a 5 nodes cluster. I configured my configuration files according to https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation 1. copy

Re: How to control the number of parquet files getting created under a partition ?

2016-03-02 Thread swetha kasireddy
Thanks. I tried this yesterday and it seems to be working. On Wed, Mar 2, 2016 at 1:49 AM, James Hammerton wrote: > Hi, > > Based on the behaviour I've seen using parquet, the number of partitions > in the DataFrame will determine the number of files in each parquet > partition. > > I.e. when yo

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Sumedh Wale
On Wednesday 02 March 2016 09:39 PM, Matthias Niehoff wrote: no, not to driver and executor but to the master and worker instances of the spark standalone cluster Why exactly does adding jars to driver/executor extraClassPath not

Re: please add Christchurch Apache Spark Meetup Group

2016-03-02 Thread Sean Owen
(I have the site's svn repo handy, so I just added it.) On Wed, Mar 2, 2016 at 5:16 PM, Raazesh Sainudiin wrote: > Hi, > > Please add Christchurch Apache Spark Meetup Group to the community list > here: > http://spark.apache.org/community.html > > Our Meetup URI is: > http://www.meetup.com/Chris

How to achieve nested for loop in Spark

2016-03-02 Thread Vikash Kumar
Can we implement nested for/while loop in spark? I have to convert some SQL procedure code into Spark. And it has multiple loops and processing and I want to implement this in spark. How to implement this. 1. open cursor and fetch for personType 2. open cursor and fetch for personGroup 3.

Re: Spark 1.5 on Mesos

2016-03-02 Thread Sathish Kumaran Vairavelu
Try passing jar using --jars option On Wed, Mar 2, 2016 at 10:17 AM Ashish Soni wrote: > I made some progress but now i am stuck at this point , Please help as > looks like i am close to get it working > > I have everything running in docker container including mesos slave and > master > > When i

please add Christchurch Apache Spark Meetup Group

2016-03-02 Thread Raazesh Sainudiin
Hi, Please add Christchurch Apache Spark Meetup Group to the community list here: http://spark.apache.org/community.html Our Meetup URI is: http://www.meetup.com/Christchurch-Apache-Spark-Meetup/ Thanks, Raaz

Re: rdd cache name

2016-03-02 Thread Xinh Huynh
Hi Charles, You can set the RDD name before using it. Just do before caching: (Scala) myRdd.setName("Charles RDD") (Python) myRdd.setName('Charles RDD') Reference: PySpark doc: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD Fraction cached is the percentage of partitions

Re: Building a REST Service with Spark back-end

2016-03-02 Thread yanlin wang
Did any one use Livy in real world high concurrency web app? I think it uses spark submit command line to create job... How about job server or notebook comparing with Livy? Thx, Yanlin Sent from my iPhone > On Mar 2, 2016, at 6:24 AM, Guru Medasani wrote: > > Hi Don, > > Here is another R

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Prabhu Joseph
Matthias, Can you check appending the jars in LAUNCH_CLASSPATH of spark-1.4.1/sbin/spark_class 2016-03-02 21:39 GMT+05:30 Matthias Niehoff : > no, not to driver and executor but to the master and worker instances of > the spark standalone cluster > > Am 2. März 2016 um 17:05 schrieb Igor Be

Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni
I made some progress but now i am stuck at this point , Please help as looks like i am close to get it working I have everything running in docker container including mesos slave and master When i try to submit the pi example i get below error *Error: Cannot load main class from JAR file:/opt/spa

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Matthias Niehoff
no, not to driver and executor but to the master and worker instances of the spark standalone cluster Am 2. März 2016 um 17:05 schrieb Igor Berman : > spark.driver.extraClassPath > spark.executor.extraClassPath > > 2016-03-02 18:01 GMT+02:00 Matthias Niehoff < > matthias.nieh...@codecentric.de>:

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Igor Berman
spark.driver.extraClassPath spark.executor.extraClassPath 2016-03-02 18:01 GMT+02:00 Matthias Niehoff : > Hi, > > we want to add jars to the Master and Worker class path mainly for logging > reason (we have a redis appender to send logs to redis -> logstash -> > elasticsearch). > > While it is wo

Add Jars to Master/Worker classpath

2016-03-02 Thread Matthias Niehoff
Hi, we want to add jars to the Master and Worker class path mainly for logging reason (we have a redis appender to send logs to redis -> logstash -> elasticsearch). While it is working with setting SPARK_CLASSPATH, this solution is afaik deprecated and should not be used. Furthermore we are also

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-02 Thread Daniel Siegmann
In the past I have seen this happen when I filled up HDFS and some core nodes became unhealthy. There was no longer anywhere to replicate the data. >From your command it looks like you should have 1 master and 2 core nodes in your cluster. Can you verify both the core nodes are healthy? On Wed, Ma

Re: Configuring Ports for Network Security

2016-03-02 Thread Guru Prateek Pinnadhari
Thanks for your response. End users and developers in our scenario need terminal / SSH access to the cluster. So cluster isolation from external networks is not an option. We use a Hortonworks based hadoop cluster. Knox is useful but as users also have shell access, we need iptables. Even with

org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow

2016-03-02 Thread dmt
Hi, the following error is raised using Spark 1.5.2 or 1.6.0, in stand alone mode, on my computer. Has anyone had the same problem, and do you know what might cause this exception ? Thanks in advance. /16/03/02 15:12:27 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 9, 192.168.1.36): java.l

Re: Configuring Ports for Network Security

2016-03-02 Thread Jörn Franke
You can make the nodes non-reachable from any computer external to the cluster. Applications can be deployed on an edge node that is connected to the cluster. Do you use Hadoop for managing the cluster? Then you may want to look at Apache Knox. > On 02 Mar 2016, at 15:14, zgpinnadhari wrote:

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Guru Medasani
Hi Don, Here is another REST interface for interacting with Spark from anywhere. https://github.com/cloudera/livy Here is an example to estimate PI using Spark from Python using requests library. >>> data = { ... 'code': textwrap.dedent("""\ ... val

Configuring Ports for Network Security

2016-03-02 Thread zgpinnadhari
Hi We want to use spark in a secure cluster with iptables enabled. For this, we need a specific list of ports used by spark so that we can whitelist them. >From what I could learn from - http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security - there are several

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Todd Nist
Have you looked at Apache Toree, http://toree.apache.org/. This was formerly the Spark-Kernel from IBM but contributed to apache. https://github.com/apache/incubator-toree You can find a good overview on the spark-kernel here: http://www.spark.tc/how-to-enable-interactive-applications-against-ap

Re: SparkR Count vs Take performance

2016-03-02 Thread Dirceu Semighini Filho
Thanks Sun, this explain why I was getting too many jobs running, my RDDs were empty. 2016-03-02 10:29 GMT-03:00 Sun, Rui : > This is nothing to do with object serialization/deserialization. It is > expected behavior that take(1) most likely runs slower than count() on an > empty RDD. > > This

RE: SparkR Count vs Take performance

2016-03-02 Thread Sun, Rui
This is nothing to do with object serialization/deserialization. It is expected behavior that take(1) most likely runs slower than count() on an empty RDD. This is all about the algorithm with which take() is implemented. Take() 1. Reads one partition to get the elements 2. If the fetched eleme

Re: Spark streaming: StorageLevel.MEMORY_AND_DISK_SER setting for KafkaUtils.createDirectStream

2016-03-02 Thread Vinti Maheshwari
Increasing Spark_executors_instances to 4 worked. SPARK_EXECUTOR_INSTANCES="4" #Number of workers to start (Default: 2) Regards, Vinti On Wed, Mar 2, 2016 at 4:28 AM, Vinti Maheshwari wrote: > Thanks much Saisai. Got it. > So i think increasing worker executor memory might work. Trying that.

Re: Spark streaming: StorageLevel.MEMORY_AND_DISK_SER setting for KafkaUtils.createDirectStream

2016-03-02 Thread Vinti Maheshwari
Thanks much Saisai. Got it. So i think increasing worker executor memory might work. Trying that. Regards, ~Vinti On Wed, Mar 2, 2016 at 4:21 AM, Saisai Shao wrote: > You don't have to specify the storage level for direct Kafka API, since it > doesn't require to store the input data ahead of ti

Re: Spark streaming: StorageLevel.MEMORY_AND_DISK_SER setting for KafkaUtils.createDirectStream

2016-03-02 Thread Saisai Shao
You don't have to specify the storage level for direct Kafka API, since it doesn't require to store the input data ahead of time. Only receiver-based approach could specify the storage level. Thanks Saisai On Wed, Mar 2, 2016 at 7:08 PM, Vinti Maheshwari wrote: > Hi All, > > I wanted to set *St

Spark streaming: StorageLevel.MEMORY_AND_DISK_SER setting for KafkaUtils.createDirectStream

2016-03-02 Thread Vinti Maheshwari
Hi All, I wanted to set *StorageLevel.MEMORY_AND_DISK_SER* in my spark-streaming program as currently i am getting MetadataFetchFailedException*. *I am not sure where i should pass StorageLevel.MEMORY_AND_DISK, as it seems like createDirectStream doesn't allow to pass that parameter. val message

Re: [Spark 1.5.2]: Iterate through Dataframe columns and put it in map

2016-03-02 Thread Mohammad Tariq
Hi Divya, You could call *collect()* method provided by DataFram API. This will give you an *Array[Rows]*. You could then iterate over this array and create your map. Something like this : val mapOfVals = scala.collection.mutable.Map[String,String]() var rows = DataFrame.collect() rows.foreach(r

spark streaming

2016-03-02 Thread Vinti Maheshwari
Hi All, I wanted to set *StorageLevel.MEMORY_AND_DISK_SER* in my spark-streaming program as currently i am getting MetadataFetchFailedException*. *I am not sure where i should pass StorageLevel.MEMORY_AND_DISK, as it seems like createDirectStream doesn't allow to pass that parameter. val message

  1   2   >