unsubscribe

2016-10-06 Thread शशिकांत कुलकर्णी
Regards, Shashikant

Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-06 Thread Sean Owen
I believe Reynold mentioned he already did that. For anyone following: https://issues.apache.org/jira/browse/INFRA-12717 On Fri, Oct 7, 2016 at 1:35 AM Luciano Resende wrote: > I have created a Infra jira to track the issue with the maven artifacts > for Spark 2.0.1 > > On Wed, Oct 5, 2016 at 10

Re: [Spark][issue]Writing Hive Partitioned table

2016-10-06 Thread Mich Talebzadeh
Hi Ayan, Depends on the version of Spark you are using. Have you tried updating stats in Hive? ANALYZE TABLE ${DATABASE}.${TABLE} PARTITION (${PARTITION_NAME}) COMPUTE STATISTICS FOR COLUMNS and then do show create table ${TABLE} HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.co

Re:

2016-10-06 Thread Mich Talebzadeh
Hi Ayan, Depends on the version of Spark you are using. Have you tried updating stats in Hive? ANALYZE TABLE ${DATABASE}.${TABLE} PARTITION (${PARTITION_NAME}) COMPUTE STATISTICS FOR COLUMNS and then do show create table ${TABLE} HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.co

Re: Kryo serializer slower than Java serializer for Spark Streaming

2016-10-06 Thread Rajkiran Rajkumar
Oops, realized that I didn't reply to all. Pasting snippet again. Hi Sean, Thanks for the reply. I've done the part of forcing registration of classes to the kryo serializer. The observation is in that scenario. To give a sense of the data, they are records which are serialized using thrift and re

Re: How to Disable or do minimal Logging for apache spark client Driver program?

2016-10-06 Thread kant kodali
Hi Jakob, It is a biggest question for me too since I seem to be on a different page than everyone else whenever I say  "I am also using spark standalonemode and I don't submit jobs through command line. I just invokepublic static void main() of my driver program" Everyone keeps talking about submi

Re: Best approach for processing all files parallelly

2016-10-06 Thread ayan guha
Hi generateFileType (filename) returns FileType getSchemaFor(FileType) returns schema for FileType This for loop DOES NOT process files sequentially. It creates dataframes on all files which are of same types sequentially. On Fri, Oct 7, 2016 at 12:08 AM, Arun Patel wrote: > Thanks Ayan. Cou

Re: RESTful Endpoint and Spark

2016-10-06 Thread Matei Zaharia
This is exactly what the Spark SQL Thrift server does, if you just want to access it using JDBC. Matei > On Oct 6, 2016, at 4:27 PM, Benjamin Kim wrote: > > Has anyone tried to integrate Spark with a server farm of RESTful API > endpoints or even HTTP web-servers for that matter? I know it’s

Re: RESTful Endpoint and Spark

2016-10-06 Thread Ofer Eliassaf
there are 2 main projects for that: livy(https://github.com/cloudera/livy) and spark job server(https://github.com/spark-jobserver/spark-jobserver). On Fri, Oct 7, 2016 at 2:27 AM, Benjamin Kim wrote: > Has anyone tried to integrate Spark with a server farm of RESTful API > endpoints or even HTT

Spark SQL is slower when DataFrame is cache in Memory

2016-10-06 Thread Chin Wei Low
Hi, I am using Spark 1.6.0. I have a Spark application that create and cache (in memory) DataFrames (around 50+, with some on single parquet file and some on folder with a few parquet files) with the following codes: val df = sqlContext.read.parquet df.persist df.count I union them to 3 DataFram

Executors under utilized

2016-10-06 Thread Pradeep Gollakota
I'm running a job that one stage with about 60k tasks. The stage was going pretty well until many of the executors were not running any tasks at around 35k tasks finished. It came to the point where only 4 executors are working on data, all 4 executors are running on the same host. With about 25k t

spark 2.0.1, union on non-null and null String dataframes causing ClassCastException UTF8String cannot be cast to java.lang.String

2016-10-06 Thread William Kinney
It seems when doing a union on a DF where one DF contains lit(null) or null for a String, causes a: java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.String when doing getString(i) on a Row within forEachPartition. Stack: Caused by: java.lang.Clas

[Spark][issue]Writing Hive Partitioned table

2016-10-06 Thread ayan guha
Posting with correct subject. On Fri, Oct 7, 2016 at 12:37 PM, ayan guha wrote: > Hi > > Faced one issue: > > - Writing Hive Partitioned table using > > df.withColumn("partition_date",to_date(df["INTERVAL_DATE"])) > .write.partitionBy('partition_date').saveAsTable("sometable" > ,mode="overwr

Re: Best Savemode option to write Parquet file

2016-10-06 Thread Chanh Le
Hi, It depends on your case but if you do shuffle it’s expensive operation unless you want to reduce number of files and it's not parallel so it might have cost you a lot of time to write data. Regards, Chanh > On Oct 7, 2016, at 1:25 AM, Anubhav Agarwal wrote: > > Hi, > I already had the f

[no subject]

2016-10-06 Thread ayan guha
Hi Faced one issue: - Writing Hive Partitioned table using df.withColumn("partition_date",to_date(df["INTERVAL_DATE"])).write.partitionBy('partition_date').saveAsTable("sometable",mode="overwrite") - Data got written to HDFS fine. I can see the folders with partition names such as /app/somedb/

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-06 Thread Cody Koeninger
OK, so at this point, even without involving commitAsync, you're seeing consumer rebalances after a particular batch takes longer than the session timeout? Do you have a minimal code example you can share? On Tue, Oct 4, 2016 at 2:18 AM, Matthias Niehoff wrote: > Hi, > sry for the late reply. A

Re: Multiple-streaming context within a jvm

2016-10-06 Thread Cody Koeninger
Maybe I'm misunderstanding, but you can have multiple kafka topics / streams in a single streaming context. On Wed, Oct 5, 2016 at 4:13 AM, Hafiz Mujadid wrote: > Hi, > > I am trying to use multiple streaming context in one spark job. I want to > fetch users data from users topic of Kafka and pur

Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-06 Thread Luciano Resende
I have created a Infra jira to track the issue with the maven artifacts for Spark 2.0.1 On Wed, Oct 5, 2016 at 10:18 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Yeah I see the apache maven repos have the 2.0.1 artifacts at > https://repository.apache.org/content/repositories/

Re: Support for uniVocity in Spark 2.x

2016-10-06 Thread Hyukjin Kwon
Yeap, there is an option to switch Apache common CSV to Univocity in external CSV library but it become univocity by default and Apache common CSV was removed during porting it into Spark 2.0. On 7 Oct 2016 2:53 a.m., "Sean Owen" wrote: > It still uses univocity, but this is an implementation de

RESTful Endpoint and Spark

2016-10-06 Thread Benjamin Kim
Has anyone tried to integrate Spark with a server farm of RESTful API endpoints or even HTTP web-servers for that matter? I know it’s typically done using a web farm as the presentation interface, then data flows through a firewall/router to direct calls to a JDBC listener that will SELECT, INSE

Re: Spark 2.0.1 has been published?

2016-10-06 Thread Aris
Mario -- I would recommend downloading and building from source, as the repositories could be lagged On Thu, Oct 6, 2016 at 4:00 PM, miliofotou wrote: > I can verify this as well. Even though I can download the 2.0.1 binary just > fine, I cannot find the 2.0.1 artifacts on mvnrepository.com or a

Re: Spark 2.0.1 has been published?

2016-10-06 Thread miliofotou
I can verify this as well. Even though I can download the 2.0.1 binary just fine, I cannot find the 2.0.1 artifacts on mvnrepository.com or any other repository. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-1-has-been-published-tp27836p27858.ht

Zombie Driver process (Standalone Cluster)

2016-10-06 Thread map reduced
Hi, I am noticing zombie driver processes running on my standalone cluster. Not sure about the reason - but node (with driver on it) restart may be a potential cause. What's interesting about these is: SparkUI doesn't recognize it as a running driver, hence no 'kill' option there. Also, if I try r

Spark Streaming Advice

2016-10-06 Thread Kevin Mellott
I'm attempting to implement a Spark Streaming application that will consume application log messages from a message broker and store the information in HDFS. During the data ingestion, we apply a custom schema to the logs, partition by application name and log date, and then save the information as

Re: How to Disable or do minimal Logging for apache spark client Driver program?

2016-10-06 Thread Mahendra Kutare
import ch.qos.logback.classic.Level; sc.setLogLevel(Level.INFO.levelStr); //Change the level to an appropriate level for your application. Mahendra about.me/mahendrakutare

Re: How to Disable or do minimal Logging for apache spark client Driver program?

2016-10-06 Thread Jakob Odersky
You can change the kind of log messages that are shown by calling "context.setLogLevel()" with an appropriate level: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN. See http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@setLogLevel(logLevel:String):Unit for fu

Re: building Spark 2.1 vs Java 1.8 on Ubuntu 16/06

2016-10-06 Thread Marco Mistroni
Thanks Fred for pointers...so far I was only able to build 2.1 with Java 7 and no zinc. Will try options u suggest. FYI building with sbt ends up in oom even with Java 7 I will try and update this thread Kr On 6 Oct 2016 8:58 pm, "Fred Reiss" wrote: > There's no option to prevent build/mvn from

Re: building Spark 2.1 vs Java 1.8 on Ubuntu 16/06

2016-10-06 Thread Fred Reiss
There's no option to prevent build/mvn from starting the zinc server, but you should be able to prevent the maven build from using the zinc server by changing the option at line 1935 of the master pom.xml. Note that the zinc-based compile works on my Ubuntu 16.04 box. You might be able to get zin

Re: Best Savemode option to write Parquet file

2016-10-06 Thread Anubhav Agarwal
Hi, I already had the following set:- sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") Will add the other setting too. But my question is I am correct in assuming Append mode shuffles all data to one node before writing? And do other modes do the same or all executors write

Re: How to implement a scheduling algorithm in Spark?

2016-10-06 Thread neil90
Try leveraging YARN/MESOS they have more Scheduling options then the standalone cluster manager. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-a-scheduling-algorithm-in-Spark-tp27848p27856.html Sent from the Apache Spark User List mailing

Re: spark standalone with multiple workers gives a warning

2016-10-06 Thread Ofer Eliassaf
The slaves should connect to the master using the scripts in sbin... You can read about it here: http://spark.apache.org/docs/latest/spark-standalone.html On Thu, Oct 6, 2016 at 6:46 PM, Mendelson, Assaf wrote: > Hi, > > I have a spark standalone cluster. On it, I am using 3 workers per node. >

Re: Support for uniVocity in Spark 2.x

2016-10-06 Thread Sean Owen
It still uses univocity, but this is an implementation detail, so I don't think that has amounted to supporting or not supporting it. On Thu, Oct 6, 2016 at 4:00 PM Jean Georges Perrin wrote: > > The original CSV parser from Databricks had support for uniVocity in Spark > 1.x. Can someone confir

Re: Spark REST API YARN client mode is not full?

2016-10-06 Thread Vadim Semenov
It may be related to the CDH version of spark you're using. When I use REST API I get yarn application id there Try opening http://localhost:4040/api/v1/applications/0/stages On Thu, Oct 6, 2016 at 8:40 AM, Vladimir Tretyakov < vladimir.tretya...@sematext.com> wrote: > Hi, > > When I start Spark

Re: How to implement a scheduling algorithm in Spark?

2016-10-06 Thread neil90
Assuming your using Sparks Standalone Cluster manager it has by default FIFO >From the docs - /By default, applications submitted to the standalone mode cluster will run in FIFO (first-in-first-out) order, and each application will try to use all available nodes./ Link - http://spark.apache.org/d

How to Disable or do minimal Logging for apache spark client Driver program?

2016-10-06 Thread kant kodali
How to Disable or do minimal Logging for apache spark client Driver program? I couldn't find this information on docs. By Driver program I mean the java program where I initialize spark context. It produces lot of INFO messages but I would like to know only when there is error or a Exception such

Submit job with driver options in Mesos Cluster mode

2016-10-06 Thread vonnagy
I am trying to submit a job to spark running in a Mesos cluster. We need to pass custom java options to the driver and executor for configuration, but the driver task never includes the options. Here is an example submit. GC_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCTimeStam

Best practice of complicated SQL query in Spark/Hive

2016-10-06 Thread Shi Yu
Hello, I wonder what is the state-of-art best practice to achieve best performance running complicated SQL query today in 2016? I am new to this topic and have read about Hive on Tez Spark on Hive Spark SQL 2.0 (It seems Spark 2.0 supports complicated nest query) The documentation I read sugge

spark standalone with multiple workers gives a warning

2016-10-06 Thread Mendelson, Assaf
Hi, I have a spark standalone cluster. On it, I am using 3 workers per node. So I added SPARK_WORKER_INSTANCES set to 3 in spark-env.sh The problem is, that when I run spark-shell I get the following warning: WARN SparkConf: SPARK_WORKER_INSTANCES was detected (set to '3'). This is deprecated in Sp

Re: Best Savemode option to write Parquet file

2016-10-06 Thread Chanh Le
Hi Abnubhav, The best way to store parquet is partition it by time or specific field that you are going to mark for delete after the time. in my case I partition my data by time so I can easy to delete the data after 30 days. Use with mode Append and disable the summary information sc.hadoopCon

Best Savemode option to write Parquet file

2016-10-06 Thread morfious902002
Hi all, I have searched a bit before posting this query. Using Spark 1.6.1 Dataframe.write().format("parquet").mode(SaveMode.Append).save("location) Note:- The data in that folder can be deleted and most of the times that folder doesn't even exist. Which Savemode is the best, if necessary at all

Best Savemode option to write Parquet file

2016-10-06 Thread Anubhav Agarwal
Hi all, I have searched a bit before posting this query. Using Spark 1.6.1 Dataframe.write().format("parquet").mode(SaveMode.Append).save("location) Note:- The data in that folder can be deleted and most of the times that folder doesn't even exist. Which Savemode is the best, if necessary at all

spark stateful streaming error

2016-10-06 Thread backtrack5
I am using pyspark stateful stream (2.0), which receives JSON from Socket. I am getting the following error, When i send more then one records. meaning if i send only one message i am getting response. If i send more than one message getting following error, def createmd5Hash(po): data = json.

Support for uniVocity in Spark 2.x

2016-10-06 Thread Jean Georges Perrin
The original CSV parser from Databricks had support for uniVocity in Spark 1.x. Can someone confirm it has disappeared (per: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/DataStreamReader.html#csv(java.lang.String)

Re: Kryo serializer slower than Java serializer for Spark Streaming

2016-10-06 Thread Sean Owen
It depends a lot on your data. If it's a lot of custom types then Kryo doesn't have a lot of advantage, although, you want to make sure to register all your classes with kryo (and consider setting the flag that requires kryo registration to ensure it) because that can let kryo avoid writing a bunch

Re: spark 2.0.1 upgrade breaks on WAREHOUSE_PATH

2016-10-06 Thread Koert Kuipers
if the intention is to create this on the default hadoop filesystem (and not local), then maybe we can use FileSystem.getHomeDirectory()? it should return the correct home directory on the relevant FileSystem (local or hdfs). if the intention is to create this only locally, then why bother using h

Kryo serializer slower than Java serializer for Spark Streaming

2016-10-06 Thread Rajkiran Rajkumar
Hi, I am running a Spark Streaming application which reads from a Kinesis stream and processes data. The application is run on EMR. Recently, we tried moving from Java's inbuilt serializer to Kryo serializer. To quantify the performance improvement, I tried pumping 3 input records to the applic

Re: spark 2.0.1 upgrade breaks on WAREHOUSE_PATH

2016-10-06 Thread Koert Kuipers
well it seems to work if set spark.sql.warehouse.dir to /tmp/spark-warehouse in spark-defaults, and it creates it on hdfs. however can this directory safely be shared between multiple users running jobs? if not then i need to set this per user (instead of single setting in spark-defaults) which m

Spark SQL query

2016-10-06 Thread AJT
>From what I have read on Spark SQL - you need to already have a dataframe which you can then query on - e.g. select * from myDataframe where Where the dataframe is either a Hive table or Avro file etc. What if you want to create a dataframe from your underlying data on the fly with input paramet

Re: Best approach for processing all files parallelly

2016-10-06 Thread Arun Patel
Thanks Ayan. Couple of questions: 1) How does generateFileType and getSchemaFor functions look like? 2) 'For loop' is processing files sequentially, right? my requirement is to process all files at same time. On Thu, Oct 6, 2016 at 8:52 AM, ayan guha wrote: > Hi > > In this case, if you see, t

Re: Best approach for processing all files parallelly

2016-10-06 Thread ayan guha
Hi In this case, if you see, t[1] is NOT the file content, as I have added a "FileType" field. So, this collect is just bringing in the list of file types, should be fine On Thu, Oct 6, 2016 at 11:47 PM, Arun Patel wrote: > Thanks Ayan. I am really concerned about the collect. > > types = rdd1

Re: Best approach for processing all files parallelly

2016-10-06 Thread Arun Patel
Thanks Ayan. I am really concerned about the collect. types = rdd1.map(lambda t: t[1]).distinct().collect() This will ship all files on to the driver, right? It must be inefficient. On Thu, Oct 6, 2016 at 7:58 AM, ayan guha wrote: > Hi > > I think you are correct direction. What is missing

Spark REST API YARN client mode is not full?

2016-10-06 Thread Vladimir Tretyakov
Hi, When I start Spark v1.6 (cdh5.8.0) in Yarn client mode I see that 4040 port is avaiable, but UI shows nothing and API returns not full information. I started Spark application like this: spark-submit --master yarn-client --class org.apache.spark.examples.SparkPi /usr/lib/spark/examp

Re: Best approach for processing all files parallelly

2016-10-06 Thread ayan guha
Hi I think you are correct direction. What is missing is: You do not need to create DF for each file. You can scramble files with similar structures together (by doing some filter on file name) and then create a DF for each type of file. Also, creating DF on wholeTextFile seems wasteful to me. I w

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-06 Thread amarouni
You can get some more insights by using the Spark history server (http://spark.apache.org/docs/latest/monitoring.html), it can show you which task is failing and some other information that might help you debugging the issue. On 05/10/2016 19:00, Babak Alipour wrote: > The issue seems to lie in t

Best approach for processing all files parallelly

2016-10-06 Thread Arun Patel
My Pyspark program is currently identifies the list of the files from a directory (Using Python Popen command taking hadoop fs -ls arguments). For each file, a Dataframe is created and processed. This is sequeatial. How to process all files paralelly? Please note that every file in the directory

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-06 Thread Nick Pentreath
I'm currently working on various performance tests for large, sparse feature spaces. For the Criteo DAC data - 45.8 million rows, 34.3 million features (categorical, extremely sparse), the time per iteration for ml.LogisticRegression is about 20-30s. This is with 4x worker nodes, 48 cores & 120GB

How to implement a scheduling algorithm in Spark?

2016-10-06 Thread anilsingh
Hello everyone, I am using Spark multi-node cluster for executing simple tasks(programs). Now I want to implement a scheduling algorithm in Spark multi-node cluster. What exactly will be the setup for the same? Can somebody guide me implementing this? -- View this message in context: http://ap

Re: spark 2.0.1 upgrade breaks on WAREHOUSE_PATH

2016-10-06 Thread Sean Owen
Yeah I see the same thing. You can fix this by setting spark.sql.warehouse.dir of course as a workaround. I restarted a conversation about it at https://github.com/apache/spark/pull/13868#pullrequestreview-3081020 I think the question is whether spark-warehouse is always supposed to be a local dir

Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.

2016-10-06 Thread Saurav Sinha
Hi User, I am trying to run spark streaming job in yarn-cluster mode. It is failing with code val conf = new SparkConf().setAppName("XXX"). conf.setMaster("yarn-cluster") val ssc = new StreamingContext(conf, Seconds(properties.getProperty("batchDurationInSeconds").toInt)); org.apache.spark.Sp

PermGen space error

2016-10-06 Thread Saurav Sinha
I am ruuning streaming job. It is running in local,yarn-client mode but when I am running it in yarn-cluster mode it is failing with PermGen space. Can any one help me out. Spark version : 1.5.0 Hadoop: 2.6.0 Java : 1.7 16/10/06 15:08:40 INFO scheduler.JobScheduler: Added jobs for time 147574672

Re: building Spark 2.1 vs Java 1.8 on Ubuntu 16/06

2016-10-06 Thread Marco Mistroni
Thanks Fred The build/mvn will trigger compilation using zinc and I want to avoid that as every time I have tried it runs into errors while compiling spark core. How can I disable zinc by default? Kr On 5 Oct 2016 10:53 pm, "Fred Reiss" wrote: > Actually the memory options *are* required for Jav

Re: Solve system of linear equations in Spark

2016-10-06 Thread Sean Owen
Do you not just want to use linear regression? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala Of course it requires a DataFrame-like input but that may be more natural to begin with. If the data set is small, then putting it

Re: mesos in spark 2.0.1 - must call stop() otherwise app hangs

2016-10-06 Thread Adrian Bridgett
Thanks for the clarification Sean, much appreciated! On 06/10/2016 09:18, Sean Owen wrote: Yes you should do that. The examples, with one exception, do show this, and it's always been the intended behavior. I guess it's no surprise to me because any 'context' object in any framework generally

Re: mesos in spark 2.0.1 - must call stop() otherwise app hangs

2016-10-06 Thread Sean Owen
Yes you should do that. The examples, with one exception, do show this, and it's always been the intended behavior. I guess it's no surprise to me because any 'context' object in any framework generally has to be shutdown for reasons like this. We need to update the one example. The twist is error

Re: pyspark: sqlContext.read.text() does not work with a list of paths

2016-10-06 Thread Laurent Legrand
Hello, I've just created the issue: https://issues.apache.org/jira/browse/SPARK-17805 For the PR, I can work on it tomorrow. Laurent Le 06/10/2016 à 09:29, Hyukjin Kwon a écrit : It seems obviously a bug. It was introduced from my PR, https://github.com/apache/spark/commit/d37c7f7f042f79

Re: How to stop a running job

2016-10-06 Thread Richard Siebeling
I think I mean the job that Mark is talking about but that's also the thing that's being stopped by the dcos command and (hopefully) the thing that's being stopped by the dispatcher, isn't it? It would be really good if the issue (SPARK-17064) would be resolved, but for now I'll do with cancelling

Re: mesos in spark 2.0.1 - must call stop() otherwise app hangs

2016-10-06 Thread Adrian Bridgett
Just one question - what about errors? Should we be wrapping our entire code in a ...finally spark.stop() clause (as per http://spark.apache.org/docs/latest/programming-guide.html#unit-testing)? BTW the .stop() requirement was news to quite a few people here, maybe it'd be a good idea to shou

Re: pyspark: sqlContext.read.text() does not work with a list of paths

2016-10-06 Thread Hyukjin Kwon
It seems obviously a bug. It was introduced from my PR, https://github.com/apache/spark/commit/d37c7f7f042f7943b5b684e53cf4284c601fb347 +1 for creating a JIRA and PR. If you have any problem with this, I would like to do this quickly. On 5 Oct 2016 9:12 p.m., "Laurent Legrand" wrote: > Hello,