Spark streaming time displayed is not current system time but it is processing current messages

2016-04-13 Thread Hemalatha A
Hi, I am facing a problem in Spark streaming. The time displayed in Spark streaming console is 4 days prior i.e., April 10th, which is not current system time of the cluster but the job is processing current messages that is pushed right now April 14th. Can anyone please advice what time does S

Re: Spark 1.6.0 - token renew failure

2016-04-13 Thread Jeff Zhang
It is not supported in spark to specify both principal and proxy-user. You need to either use proxy-user or use principal. Seems currently spark only check that from spark submit arguments but ignore the configuration in spark-defaults.xml if (proxyUser != null && principal != null) { SparkSub

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-13 Thread Sun, Rui
In SparkSubmit, there is less work for yarn-client than that for yarn-cluster. Basically prepare some spark configurations into system prop , for example, information on additional resources required by the application that need to be distributed to the cluster. These configurations will be used

Memory needs when using expensive operations like groupBy

2016-04-13 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Scala 2.10 and my Spark job keeps failing with exit code 143 . except one job where I am using unionAll and groupBy operation on multiple columns . Please advice me the options to optimize it . The one option which I am using it now --conf spark.executor.extraJavaOp

RE: 回复: build/sbt gen-idea error

2016-04-13 Thread Yu, Yucai
Reminder: gen-idea has been removed in the master. See: commit a172e11cba6f917baf5bd6c4f83dc6689932de9a Author: Luciano Resende Date: Mon Apr 4 16:55:59 2016 -0700 [SPARK-14366] Remove sbt-idea plugin ## What changes were proposed in this pull request? Remove sbt-idea plugin as i

pyspark EOFError after calling map

2016-04-13 Thread Pete Werner
Hi I am new to spark & pyspark. I am reading a small csv file (~40k rows) into a dataframe. from pyspark.sql import functions as F df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/tmp/sm.csv') df = df.withColumn('verified', F.when(df['veri

Error at starting httpd after the instillation using spark-ec2 script

2016-04-13 Thread Mohed Alibrahim
Dear All, I installed spark 1.6.1 on Amazon EC2 using spark-ec2 script. Everything was OK, but , it failed to start httpd at the end of the installation. I followed exactly the instruction and I repeated the process many times, but there is no luck. - [timing] rstudio setup: 00h

Re: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-13 Thread Andrei
> > Julia can pick the env var, and set the system properties or directly fill > the configurations into a SparkConf, and then create a SparkContext That's the point - just setting master to "yarn-client" doesn't work, even in Java/Scala. E.g. following code in *Scala*: val conf = new SparkConf

Re: Strange bug: Filter problem with parenthesis

2016-04-13 Thread Michael Armbrust
You need to use `backticks` to reference columns that have non-standard characters. On Wed, Apr 13, 2016 at 6:56 AM, wrote: > Hi, > > I am debugging a program, and for some reason, a line calling the > following is failing: > > df.filter("sum(OpenAccounts) > 5").show > > It says it cannot find t

error "Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe."

2016-04-13 Thread AlexModestov
I get this error. Who knows what does it mean? Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: org.apache.spark.storage.BlockFetchE

RE: how does sc.textFile translate regex in the input.

2016-04-13 Thread Yong Zhang
It is described in "Hadoop Definition Guild", chapter 3, FilePattern https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch03.html#FilePatterns Yong From: pradeep1...@gmail.com Date: Wed, 13 Apr 2016 18:56:58 + Subject: how does sc.textFile translate regex in the

Re: Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet file (too small)

2016-04-13 Thread Mich Talebzadeh
actually how many tables are involved here. what is the version of Hive used? Sorry I have no idea about Cloudera 5.5.1 spec. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

how does sc.textFile translate regex in the input.

2016-04-13 Thread Pradeep Nayak
I am trying to understand on how spark's sc.textFile() works. I specifically have the question on how it translates the paths with regex in it. For example: files = sc.textFile("hdfs://:/file1/*/*/*/*.txt") How does it find all the sub-directories and recurses to all the leaf files. ? Is there a

Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet file (too small)

2016-04-13 Thread pseudo oduesp
hi guys , i have this error after 5 hours of processing i make lot of joins 14 left joins with small table : i saw in the spark ui and console log evrithing ok but when he save last join i get this error Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet f

Error starting Spark 1.6.1

2016-04-13 Thread Mohed Alibrahim
Dear All, I installed spark 1.6.1 on Amazon EC2 using spark-ec2 script. Everything was OK, but , it failed to start httpd at the end of the installation. I followed exactly the instruction and I repeated the process many times, but there is no luck. - [timing] rstudio setup: 00h

Re: Logging in executors

2016-04-13 Thread Carlos Rojas Matas
Hi Yong, thanks for your response. As I said in my first email, I've tried both the reference to the classpath resource (env/dev/log4j-executor.properties) as the file:// protocol. Also, the driver logging is working fine and I'm using the same kind of reference. Below the content of my classpath

Re: Silly question...

2016-04-13 Thread Mich Talebzadeh
These are the components *java -versionjava version "1.8.0_77"*Java(TM) SE Runtime Environment (build 1.8.0_77-b03) Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode) *hadoop versionHadoop 2.6.0*Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba9

RE: Logging in executors

2016-04-13 Thread Yong Zhang
Is the env/dev/log4j-executor.properties file within your jar file? Is the path matching with what you specified as env/dev/log4j-executor.properties? If you read the log4j document here: https://logging.apache.org/log4j/1.2/manual.html When you specify the log4j.configuration=my_custom.propertie

Re: Logging in executors

2016-04-13 Thread Carlos Rojas Matas
Thanks for your response Ted. You're right, there was a typo. I changed it, now I'm executing: bin/spark-submit --master spark://localhost:7077 --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration

Blog: Better Feature Engineering with Spark, Solr, and Lucene Analyzers

2016-04-13 Thread Steve Rowe
FYI, I wrote functionality to enable Lucene text analysis components to be used to extract text features via a transformer in spark.ml pipelines. Non-machine-learning uses supported too. See my blog describing the capabilities, which are included in the open-source spark-solr project:

Re: Silly question...

2016-04-13 Thread Michael Segel
Mich Are you building your own releases from the source? Which version of Scala? Again, the builds seem to be ok and working, but I don’t want to hit some ‘gotcha’ if I could avoid it. > On Apr 13, 2016, at 7:15 AM, Mich Talebzadeh > wrote: > > Hi, > > I am not sure this helps. > > we

Re: Streaming WriteAheadLogBasedBlockHandler disallows parellism via StorageLevel replication factor

2016-04-13 Thread Ted Yu
w.r.t. the effective storage level log, here is the JIRA which introduced it: [SPARK-4671][Streaming]Do not replicate streaming block when WAL is enabled On Wed, Apr 13, 2016 at 7:43 AM, Patrick McGloin wrote: > Hi all, > > If I am using a Custom Receiver with Storage Level set to StorageLevel.

EMR Spark log4j and metrics

2016-04-13 Thread Peter Halliday
I have an existing cluster that I stand up via Docker images and CloudFormation Templates on AWS. We are moving to EMR and AWS Data Pipeline process, and having problems with metrics and log4j. We’ve sent a JSON configuration for spark-log4j and spark-metrics. The log4j file seems to be basi

Re: Spark acessing secured HDFS

2016-04-13 Thread vijikarthi
Looks like the support does not exist unless someone counter it and there is a open JIRA. https://issues.apache.org/jira/browse/SPARK-12909 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-acessing-secured-HDFS-tp26766p26778.html Sent from the Apache S

Re: Logging in executors

2016-04-13 Thread Ted Yu
bq. --conf "spark.executor.extraJavaOptions=-Dlog4j. configuration=env/dev/log4j-driver.properties" I think the above may have a typo : you refer to log4j-driver.properties in both arguments. FYI On Wed, Apr 13, 2016 at 8:09 AM, Carlos Rojas Matas wrote: > Hi guys, > > I'm trying to enable log

Logging in executors

2016-04-13 Thread Carlos Rojas Matas
Hi guys, I'm trying to enable logging in the executors but with no luck. According to the oficial documentation and several blogs, this should be done passing the "spark.executor.extraJavaOpts=-Dlog4j.configuration=[my-file]" to the spark-submit tool. I've tried both sending a reference to a clas

Streaming WriteAheadLogBasedBlockHandler disallows parellism via StorageLevel replication factor

2016-04-13 Thread Patrick McGloin
Hi all, If I am using a Custom Receiver with Storage Level set to StorageLevel. MEMORY_ONLY_SER_2 and the WAL enabled I get this Warning in the logs: 16/04/13 14:03:15 WARN WriteAheadLogBasedBlockHandler: Storage level replication 2 is unnecessary when write ahead log is enabled, change to replic

Re: Silly question...

2016-04-13 Thread Mich Talebzadeh
Hi, I am not sure this helps. we use Spark 1.6 and Hive 2. I also use JDBC (beeline for Hive) plus Oracle and Sybase. They all work fine. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Spark fileStream from a partitioned hive dir

2016-04-13 Thread Daniel Haviv
Hi, We have a hive table which gets data written to it by two partition keys, day and hour. We would like to stream the incoming files assince fileStream can only listen on one directory we start a streaming job on the latest partition and every hour kill it and start a new one on a newer partition

Strange bug: Filter problem with parenthesis

2016-04-13 Thread Saif.A.Ellafi
Hi, I am debugging a program, and for some reason, a line calling the following is failing: df.filter("sum(OpenAccounts) > 5").show It says it cannot find the column OpenAccounts, as if it was applying the sum() function and looking for a column called like that, where there is not. This work

?????? build/sbt gen-idea error

2016-04-13 Thread ImMr.K
Actually, same error occurred when I ran build/sbt compile or other commands. After struggled for some time, I reminded that I used proxy to connect to Internet. So set proxy to maven, everything seems OK. Just remind those who use proxies. -- Best regards, Ze Jin -

Re: Please assist: Spark 1.5.2 / cannot find StateSpec / State

2016-04-13 Thread Matthias Niehoff
The StateSpec and the mapWithState method is only available in Spark 1.6.x 2016-04-13 11:34 GMT+02:00 Marco Mistroni : > hi all > i am trying to replicate the Streaming Wordcount example described here > > > https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/exa

Problem with History Server

2016-04-13 Thread alvarobrandon
Hello: I'm using the history server to keep track of the applications I run in my cluster. I'm using Spark with YARN. When I run on application it finishes correctly even YARN says that it finished. This is the result of the YARN Resource Manager API {u'app': [{u'runningContainers': -1, u'allocat

Performance Tuning | Shark 0.9.1 with Spark 1.0.2

2016-04-13 Thread N, Manjunath 3. (EXT - IN/Noida)
Hi, I am trying to reduce the query performance. I am not sure how to go about in shark/spark this. Here is my problem. When I execute a query it is ran twice and here is summary. First is Filesink's runjob and next is mapPartitionis executed. 1. Filesink uses only one job always is there

Re: S3n performance (@AaronDavidson)

2016-04-13 Thread Gourav Sengupta
Hi, I have stopped working on s3n for a long time now. In case you are working with parquet and writing files s3a is the only alternative to failures. Otherwise why not use just s3://? Regards, Gourav On Wed, Apr 13, 2016 at 12:17 PM, Steve Loughran wrote: > > On 12 Apr 2016, at 22:05, Martin

Re: S3n performance (@AaronDavidson)

2016-04-13 Thread Steve Loughran
On 12 Apr 2016, at 22:05, Martin Eden mailto:martineden...@gmail.com>> wrote: Hi everyone, Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I manage to process approx 1TB of strings inside gzipped parquet in about 50 mins on a 20 node cluster (8 cores, 60Gb ram). That's

RE: ML Random Forest Classifier

2016-04-13 Thread Ashic Mahtab
It looks like all of that is building up to spark 2.0 (for random forests / gbts / etc.). Ah well...thanks for your help. Was interesting digging into the depths. Date: Wed, 13 Apr 2016 09:48:32 +0100 Subject: Re: ML Random Forest Classifier From: ja...@gluru.co To: as...@live.com CC: user@spark

Please assist: Spark 1.5.2 / cannot find StateSpec / State

2016-04-13 Thread Marco Mistroni
hi all i am trying to replicate the Streaming Wordcount example described here https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala in my build,sbt i have the following dependencies . libraryDependencies += "o

Re: spark.driver.extraClassPath and export SPARK_CLASSPATH

2016-04-13 Thread AlexModestov
I wrote in "spark-defaults.conf" spark.driver.extraClassPath '/dir' or "PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" /.../sparkling-water-1.6.1/bin/pysparkling \ --conf spark.driver.extraClassPath='/.../sqljdbc41.jar' Nothing works -- View this message in context: htt

Re: ML Random Forest Classifier

2016-04-13 Thread James Hammerton
Hi Ashic, Unfortunately I don't know how to work around that - I suggested this line as it looked promising (I had considered it once before deciding to use a different algorithm) but I never actually tried it. Regards, James On 13 April 2016 at 02:29, Ashic Mahtab wrote: > It looks like the

Spark 1.6.0 - token renew failure

2016-04-13 Thread Luca Rea
Hi, I'm testing Livy server with Hue 3.9 and Spark 1.6.0 inside a kerberized cluster (HDP 2.4), when I run the command /usr/java/jdk1.7.0_71//bin/java -Dhdp.version=2.4.0.0-169 -cp /usr/hdp/2.4.0.0-169/spark/conf/:/usr/hdp/2.4.0.0-169/spark/lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0

RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-13 Thread ashesh_28
Are you running from eclipse ? If so add the *Hadoop_conf_dir* path to the classpath And then you can access your hdfs directory as below object sparkExample { def main(args: Array[String]){ val logname = "///user/hduser/input/sample.txt" val conf = new SparkConf().setAppName("SimpleA