hbase Put object kryo serialisation error

2015-12-09 Thread Shushant Arora
Hi I have a javapairrdd pairrdd. when i do rdd.persist(StorageLevel.MEMORY_AND_DISK()). It throws exception : com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID:100 serialisationtrace: familyMap(org.apache.hadoop.hbase.cleint.Put) I have registerd Put,and TreeMap in sp

GLM in apache spark in MLlib

2015-12-09 Thread Arunkumar Pillai
Hi I'm started using apache spark 1.5.2 version. I'm able to see GLM using SparkR but it is not there in MLlib. Is there any plans or road map for that -- Thanks and Regards Arun

Re: Spark Stream Monitoring with Kafka Direct API

2015-12-09 Thread Saisai Shao
I think this is the right JIRA to fix this issue ( https://issues.apache.org/jira/browse/SPARK-7111). It should be in Spark 1.4. On Thu, Dec 10, 2015 at 12:32 AM, Cody Koeninger wrote: > Looks like probably > > https://issues.apache.org/jira/browse/SPARK-8701 > > so 1.5.0 > > On Wed, Dec 9, 2015

Schedular delay in spark 1.4

2015-12-09 Thread Renu Yadav
Hi, I am working on spark 1.4 . I am running a spark job on a yarn cluster .When number of other jobs are less then my spark job completes very smoothly and when more number of small job run on the cluster my spark job starts showing schedular delay at the end on each stage. PS:I am running my sp

sortByKey not spilling to disk? (PySpark 1.3)

2015-12-09 Thread YaoPau
I'm running sortByKey on a dataset that's nearly the amount of memory I've provided to executors (I'd like to keep the amount of used memory low so other jobs can run), and I'm getting the vague "filesystem closed" error. When I re-run with more memory it runs fine. By default shouldn't sortByKey

Re: HTTP Source for Spark Streaming

2015-12-09 Thread Vijay Gharge
I am not sure if spark natively support this functionality. Custom poller class can query HTTP resources as per configured interval and dump it on HDFS / other stores in csv or json format. Using lambda arch (aws) or invoking sc context you can use these values for further processing On Thursday 1

Spark 1.5.2 error on quitting spark in windows 7

2015-12-09 Thread skypickle
If I start spark-shell then just quit, I get an error. scala> :q Stopping spark context. 15/12/09 23:43:32 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\Stefan\AppData\Local\Temp\spark-68d3a813-9c55-4649-aa7a-5fc269e669e7 java.io.IOException: Failed to delete: C:\Us

IP error on starting spark-shell on windows 7

2015-12-09 Thread Stefan Karos
On starting spark-shell I see this just before the scala prompt: WARN : Your hostname, BloomBear-SSD resolves to a loopback/non-reachable address: fe80:0:0:0:0:5efe:c0a8:317%net10, but we couldn't find any external IP address! I get this error even when firewall is disabled. I also tried setting

Re: HTTP Source for Spark Streaming

2015-12-09 Thread Vijay Gharge
Not very clear. Can you elaborate your use case ? On Thursday 10 December 2015, Sourav Mazumder wrote: > Hi All, > > Currently is there a way using which one can connect to a http server to > get data as a dstream at a given frequency ? > > Or one has to write own utility for the same ? > > Rega

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Gokula Krishnan D
Please refer the link and drop() provides features to drop the rows with Null / Non-Null columns. Hope, it also helps. https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions Thanks & Regards, Gokula Krishnan* (Gokul)* On Wed, Dec 9, 2015 at 11:12 AM,

Re: SparkStreaming variable scope

2015-12-09 Thread Harsh J
> and then calling getRowID() in the lambda, because the function gets sent to the executor right? Yes, that is correct (vs. a one time evaluation, as was with your assignment earlier). On Thu, Dec 10, 2015 at 3:34 AM Pinela wrote: > Hey Bryan, > > Thank for the answer ;) I knew it was a basic

Re: how to reference aggregate columns

2015-12-09 Thread Harsh J
While the DataFrame lookups can identify that anonymous column name, SparkSql does not appear to do so. You should use an alias instead: val rows = Seq (("X", 1), ("Y", 5), ("Z", 4)) val rdd = sc.parallelize(rows) val dataFrame = rdd.toDF("user","clicks") val sumDf = dataFrame.groupBy("user").agg(

how to reference aggregate columns

2015-12-09 Thread skaarthik oss
I am trying to process an aggregate column. The name of the column is "SUM(clicks)" which is automatically assigned when I use SUM operator on a column named "clicks". I am trying to find the max value in this aggregated column. However, using max operator on this aggregated columns results in pars

count distinct in spark sql aggregation

2015-12-09 Thread fightf...@163.com
Hi, I have a use case that need to get daily, weekly or monthly active users count according to the native hourly data, say as a large datasets. The native datasets are instantly updated and I want to get the distinct active user count per time dimension. Anyone can show some efficient way of r

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
I ran a 124M dataset on my laptop with isEmpty it took 32 minutes without isEmpty it took 18 minutes all but 1.5 minutes were in writing to Elasticsearch, which is on the same laptop So excluding the time writing to Elasticsearch, which was nearly the same in both cases, the core Spark code took

Re: distcp suddenly broken with spark-ec2 script setup

2015-12-09 Thread Alex Gittens
BTW, yes the referenced s3 bucket does exist, and hdfs dfs -ls s3n://agittens/CFSRArawtars does list the entries, although it first prints the same warnings: 015-12-10 00:26:53,815 WARN httpclient.RestS3Service (RestS3Service.java:performRequest(393)) - Response '/CFSRArawtars' - Unexpected res

distcp suddenly broken with spark-ec2 script setup

2015-12-09 Thread AlexG
I've been using the same method to launch my clusters then pull my data from S3 to local hdfs: $SPARKHOME/ec2/spark-ec2 -k mykey -i ~/.ssh/mykey.pem -s 29 --instance-type=r3.8xlarge --placement-group=pcavariants --copy-aws-credentials --hadoop-major-version=2 --spot-price=2.8 launch mycluster --re

StackOverflowError when writing dataframe to table

2015-12-09 Thread apu mishra . rr
The command mydataframe.write.saveAsTable(name="tablename") sometimes results in java.lang.StackOverflowError (see below for fuller error message). This is after I am able to successfully run cache() and show() methods on mydataframe. The issue is not deterministic, i.e. the same code sometimes

Re: Re: Spark RDD cache persistence

2015-12-09 Thread Calvin Jia
Hi Deepak, For persistence across Spark jobs, you can store and access the RDDs in Tachyon. Tachyon works with ramdisk which would give you similar in-memory performance you would have within a Spark job. For more information, you can take a look at the docs on Tachyon-Spark integration: http://t

Re: spark shared RDD

2015-12-09 Thread Calvin Jia
Hi Ben, Tachyon can be used to share data between spark jobs. If you specify the input to your jobs as a Tachyon path, you can leverage Tachyon's memory centric storage on reads, improving the performance when reading the same dataset multiple times. The examples on this page may be helpful: http:

Re: Release data for spark 1.6?

2015-12-09 Thread Michael Armbrust
The release date is "as soon as possible". In order to make an Apache release we must present a release candidate and have 72-hours of voting by the PMC. As soon as there are no known bugs, the vote will pass and 1.6 will be released. In the mean time, I'd love support from the community testing

Re: Saving RDDs in Tachyon

2015-12-09 Thread Calvin Jia
Hi Mark, Were you able to successfully store the RDD with Akhil's method? When you read it back as an objectFile, you will also need to specify the correct type. You can find more information about integrating Spark and Tachyon on this page: http://tachyon-project.org/documentation/Running-Spark-

Re: Release data for spark 1.6?

2015-12-09 Thread Sri
Hi Ted, Thanks for the info , but there is no particular release date from my understanding the package is in testing there is no release date mentioned. Thanks Sri Sent from my iPhone > On 9 Dec 2015, at 21:38, Ted Yu wrote: > > See this thread: > > http://search-hadoop.com/m/q3RTtBMZpK7

Re: Sharing object/state accross transformations

2015-12-09 Thread JayKay
Does anybody have a hint for me? Maybe its too trivial to see for me and I'm blind. Please give me some advice. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sharing-object-state-accross-transformations-tp25544p25655.html Sent from the Apache Spark User Li

Re: SparkStreaming variable scope

2015-12-09 Thread Pinela
Hey Bryan, Thank for the answer ;) I knew it was a basic python/spark-noob thing :) this also worked *def getRowID():* * return datetime.now().strftime("%Y%m%d%H%M%S")* and then calling getRowID() in the lambda, because the function gets sent to the executor right? Thanks again for the quick

RE: can i process multiple batch in parallel in spark streaming

2015-12-09 Thread Tim Barthram
Have a look at creating a scheduler allocation file with fair scheduling. FAIR 1 2 FAIR 1 2 Set the following: def settingsMap = Map(("spark.scheduler.allocation.file", schedulerAllocationFile), ("spark.scheduler.mode"

Is Spark History Server supported for Mesos?

2015-12-09 Thread Kelvin Chu
Spark on YARN can use History Server by setting the configuration spark.yarn.historyServer.address. But, I can't find similar config for Mesos. Is History Server supported by Spark on Mesos? Thanks. Kelvin

Re: HiveContext creation failed with Kerberos

2015-12-09 Thread Neal Yin
Thanks Steve! I tried both 1.5.3 and 1.6.0 from git-clone-build.1.5.3 is still broken. 1.6.0 does work! This one is probably the one. https://issues.apache.org/jira/browse/SPARK-11821 -Neal From: Steve Loughran mailto:ste...@hortonworks.com>> Date: Tuesday, December 8, 2015 at 4:09 AM To

Re: Release data for spark 1.6?

2015-12-09 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtBMZpK7lEFB1/Spark+1.6.0+RC&subj=Re+VOTE+Release+Apache+Spark+1+6+0+RC1+ > On Dec 9, 2015, at 1:20 PM, "kali.tumm...@gmail.com" > wrote: > > Hi All, > > does anyone know exact release data for spark 1.6 ? > > Thanks > Sri > > > > -- > View

Re: Recursive nested wildcard directory walking in Spark

2015-12-09 Thread James Ding
I’ve set “mapreduce.input.fileinputformat.input.dir.recursive” to “true” in the SparkConf I use to instantiate SparkContext, and I confirm this at runtime in my scala job to print out this property, but sparkContext.textFile(“/foo/*/bar/*.gz”) still fails (so do /foo/**/bar/*.gz and /foo/*/*/bar/*.

HTTP Source for Spark Streaming

2015-12-09 Thread Sourav Mazumder
Hi All, Currently is there a way using which one can connect to a http server to get data as a dstream at a given frequency ? Or one has to write own utility for the same ? Regards, Sourav

Release data for spark 1.6?

2015-12-09 Thread kali.tumm...@gmail.com
Hi All, does anyone know exact release data for spark 1.6 ? Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Release-data-for-spark-1-6-tp25654.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Multiple drivers, same worker

2015-12-09 Thread andresb...@gmail.com
Ok, attached you can see the jstack 2015-12-09 14:22 GMT-06:00 andresb...@gmail.com : > Sadly, no. > > The only evidence I have is the master's log which shows that the Driver > was requested: > > 15/12/09 18:25:06 INFO Master: Driver submitted > org.apache.spark.deploy.worker.DriverWrapper > 15/

Re: RegressionModelEvaluator (from jpmml) NotSerializableException when instantiated in the driver

2015-12-09 Thread Utkarsh Sengar
To add some context about JPMML and spark: "The JPMML-Evaluator library depends on JPMML-Model and Google Guava library versions that are in conflict with the ones that are bundled with Apache Spark and/or Apache Hadoop. I solved this using mvn shade plugin by using a different namespace for JPMML-

Re: can i process multiple batch in parallel in spark streaming

2015-12-09 Thread prateek arora
Hi Thanks In my scenario batches are independent .so is it safe to use in production environment ? Regards Prateek On Wed, Dec 9, 2015 at 11:39 AM, Ted Yu wrote: > Have you seen this thread ? > > http://search-hadoop.com/m/q3RTtgSGrobJ3Je > > On Wed, Dec 9, 2015 at 11:12 AM, prateek arora > w

Re: Recursive nested wildcard directory walking in Spark

2015-12-09 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTt2uhMX1UhnCc1&subj=Re+Does+sc+newAPIHadoopFile+support+multiple+directories+or+nested+directories+ FYI On Wed, Dec 9, 2015 at 11:18 AM, James Ding wrote: > Hi! > > My name is James, and I’m working on a question there doesn’t seem to b

SparkML. RandomForest predict performance for small dataset.

2015-12-09 Thread Eugene Morozov
Hello, I'm using RandomForest pipeline (ml package). Everything is working fine (learning models, prediction, etc), but I'd like to tune it for the case, when I predict with small dataset. My issue is that when I apply (PipelineModel)model.transform(dataset) The model consists of the following s

Re: Multiple drivers, same worker

2015-12-09 Thread andresb...@gmail.com
Sadly, no. The only evidence I have is the master's log which shows that the Driver was requested: 15/12/09 18:25:06 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper 15/12/09 18:25:06 INFO Master: Launching driver driver-20151209182506-0164 on worker worker-2015120918153

Re: Multiple drivers, same worker

2015-12-09 Thread Ted Yu
When this happened, did you have a chance to take jstack of the stuck driver process ? Thanks On Wed, Dec 9, 2015 at 11:38 AM, andresb...@gmail.com wrote: > Forgot to mention that it doesn't happen every time, it's pretty random so > far. We've have complete days when it behaves just fine and o

Re: SparkStreaming variable scope

2015-12-09 Thread Bryan Cutler
rowid from your code is a variable in the driver, so it will be evaluated once and then only the value is sent to words.map. You probably want to have rowid be a lambda itself, so that it will get the value at the time it is evaluated. For example if I have the following: >>> data = sc.paralleli

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
The “Any” is required by the code it is being passed to, which is the Elasticsearch Spark index writing code. The values are actually RDD[(String, Map[String, String])] No shuffle that I know of. RDDs are created from the output of Mahout SimilarityAnalysis.cooccurrence and are turned into RDD[

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel wrote: > The “Any” is required by the code it is being passed to, which is the > Elasticsearch Spark index writing code. The values are actually RDD[(String, > Map[String, String])] (Is it frequently a big big map by any chance?) > No shuffle that I kno

Re: can i process multiple batch in parallel in spark streaming

2015-12-09 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtgSGrobJ3Je On Wed, Dec 9, 2015 at 11:12 AM, prateek arora wrote: > Hi > > when i run my spark streaming application .. following information show on > application streaming UI. > i am using spark 1.5.0 > > > Batch Time Inp

Re: Multiple drivers, same worker

2015-12-09 Thread andresb...@gmail.com
Forgot to mention that it doesn't happen every time, it's pretty random so far. We've have complete days when it behaves just fine and others when it gets crazy. We're using spark 1.5.2 2015-12-09 13:33 GMT-06:00 andresb...@gmail.com : > Hi everyone, > > We've been getting an issue with spark lat

Multiple drivers, same worker

2015-12-09 Thread andresb...@gmail.com
Hi everyone, We've been getting an issue with spark lately where multiple drivers are assigned to a same worker but resources are never assigned to them and get "stuck" forever. If I login in the worker machine I see that the driver processes aren't really running and the worker's log don't show

Recursive nested wildcard directory walking in Spark

2015-12-09 Thread James Ding
Hi! My name is James, and I’m working on a question there doesn’t seem to be a lot of answers about online. I was hoping spark/hadoop gurus could shed some light on this. I have a data feed on NFS that looks like /foobar/.gz Currently I have a spark scala job that calls sparkContext.textFile(

can i process multiple batch in parallel in spark streaming

2015-12-09 Thread prateek arora
Hi when i run my spark streaming application .. following information show on application streaming UI. i am using spark 1.5.0 Batch Time Input Size Scheduling Delay (?) Processing Time (?) Status 2015/12/09 11:00:42 107 events - - que

RegressionModelEvaluator (from jpmml) NotSerializableException when instantiated in the driver

2015-12-09 Thread Utkarsh Sengar
I am trying to load a PMML file in a spark job. Instantiate it only once and pass it to the executors. But I get a NotSerializableException for org.xml.sax.helpers.LocatorImpl which is used inside jpmml. I have this class Prediction.java: public class Prediction implements Serializable { priva

Re: [Spark-1.5.2][Hadoop-2.6][Spark SQL] Cannot run queries in SQLContext, getting java.lang.NoSuchMethodError

2015-12-09 Thread Michael Armbrust
java.lang.NoSuchMethodError almost always means you have the wrong version of some library (different than what Spark was compiled with) on your classpath.; In this case the Jackson parser. On Wed, Dec 9, 2015 at 10:38 AM, Matheus Ramos wrote: > ​I have a Java application using *Spark SQL* (*Spa

[Spark-1.5.2][Hadoop-2.6][Spark SQL] Cannot run queries in SQLContext, getting java.lang.NoSuchMethodError

2015-12-09 Thread Matheus Ramos
​I have a Java application using *Spark SQL* (*Spark 1.5.2* using *local mode*), but I cannot execute any SQL commands without getting errors. This is the code I am executing: //confs SparkConf sparkConf = new SparkConf(); sparkConf.set("spark.master","local"); sparkConf.set("spar

Mesos scheduler obeying limit of tasks / executor

2015-12-09 Thread Charles Allen
I have a spark app in development which has relatively strict cpu/mem ratios that are required. As such, I cannot arbitrarily add CPUs to a limited memory size. The general spark cluster behaves as expected, where tasks are launched with a specified memory/cpu ratio, but the mesos scheduler seems

SparkStreaming variable scope

2015-12-09 Thread jpinela
Hi Guys, I am sure this is a simple question, but I can't find it in the docs anywhere. This reads from flume and writes to hbase (as you can see). But has a variable scope problem (I believe). I have the following code: * from pyspark.streaming import StreamingContext from pyspark.streaming.flume

[Spark-1.5.2][Hadoop-2.6][Spark SQL] Cannot run queries in SQLContext, getting java.lang.NoSuchMethodError

2015-12-09 Thread Matheus Ramos
​I have a Java application using *Spark SQL* (*Spark 1.5.2* using *local mode*), but I cannot execute any SQL commands without getting errors. This is the code I am executing: //confs SparkConf sparkConf = new SparkConf(); sparkConf.set("spark.master","local"); sparkConf.set("spar

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
Yes but what is the code that generates the RDD? is it a shuffle of something? that could cause checking for any element to be expensive since computing the RDD at all is expensive. Look at the stages in these long-running jobs. How could isEmpty not be distributed? the driver can't know whether t

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
Err, compiled for Spark 1.3.1, running on 1.5.1 if that makes any difference. The Spark impl is “provided” so should be using 1.5.1 code afaik. The code is as you see below for isEmpty, so not sure what else could it could be measuring since it’s the only spark thing on the line. I can regen the

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
Are you sure it's isEmpty? and not an upstream stage? isEmpty is definitely the action here. It doesn't make sense that take(1) is so much faster since it's the "same thing". On Wed, Dec 9, 2015 at 5:11 PM, Pat Ferrel wrote: > Sure, I thought this might be a known issue. > > I have a 122M datase

Re: Content based window operation on Time-series data

2015-12-09 Thread Arun Verma
Thank you for your reply. It is a Scala and Python library. Is similar library exists for Java? On Wed, Dec 9, 2015 at 10:26 PM, Sean Owen wrote: > CC Sandy as his https://github.com/cloudera/spark-timeseries might be > of use here. > > On Wed, Dec 9, 2015 at 4:54 PM, Arun Verma > wrote: > > Hi

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
Sure, I thought this might be a known issue. I have a 122M dataset, which is the trust and rating data from epinions. The data is split into two RDDs and there is an item properties RDD. The code is just trying to remove any empty RDD from the list. val esRDDs: List[RDD[(String, Map[String, Any

Re: Content based window operation on Time-series data

2015-12-09 Thread Sean Owen
CC Sandy as his https://github.com/cloudera/spark-timeseries might be of use here. On Wed, Dec 9, 2015 at 4:54 PM, Arun Verma wrote: > Hi all, > > We have RDD(main) of sorted time-series data. We want to split it into > different RDDs according to window size and then perform some aggregation > o

Content based window operation on Time-series data

2015-12-09 Thread Arun Verma
Hi all, *We have RDD(main) of sorted time-series data. We want to split it into different RDDs according to window size and then perform some aggregation operation like max, min etc. over each RDD in parallel.* If window size is w then ith RDD has data from (startTime + (i-1)*w) to (startTime + i

Unsubsribe

2015-12-09 Thread Michael Nolting
cancel -- *Michael Nolting* Head of Sevenval FDX *Sevenval Technologies GmbH * FRONT-END-EXPERTS SINCE 1999 Köpenicker Straße 154 | 10997 Berlin office +49 30 707 190 - 278 mail michael.nolt

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
It should at best collect 1 item to the driver. This means evaluating at least 1 element of 1 partition. I can imagine pathological cases where that's slow, but, do you have any more info? how slow is slow and what is slow? On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel wrote: > I’m getting *huge* ex

RDD.isEmpty

2015-12-09 Thread Pat Ferrel
I’m getting *huge* execution times on a moderate sized dataset during the RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty calculation. I’m using Spark 1.5.1 and from researching I would expect this calculation to be linearly proportional to the number of partitions as a

Re: Spark Stream Monitoring with Kafka Direct API

2015-12-09 Thread Dan Dutrow
I'm on spark version 1.4.1. I couldn't find documentation that said it was fixed, so I thought maybe it was still an open issue. Any idea what the fix version is? On Wed, Dec 9, 2015 at 11:10 AM Cody Koeninger wrote: > Which version of spark are you on? I thought that was added to the spark > U

Re: Spark Stream Monitoring with Kafka Direct API

2015-12-09 Thread Cody Koeninger
Looks like probably https://issues.apache.org/jira/browse/SPARK-8701 so 1.5.0 On Wed, Dec 9, 2015 at 10:25 AM, Dan Dutrow wrote: > I'm on spark version 1.4.1. I couldn't find documentation that said it was > fixed, so I thought maybe it was still an open issue. Any idea what the fix > version

Re: spark data frame write.mode("append") bug

2015-12-09 Thread Seongduk Cheon
Not for sure, but I think it is bug as of 1.5. Spark is using LIMIT keyword whether a table exists. https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48 If your database does not support LIMIT keyword such as S

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Gokula Krishnan D
Ok, then you can slightly change like [image: Inline image 1] Thanks & Regards, Gokula Krishnan* (Gokul)* On Wed, Dec 9, 2015 at 11:09 AM, Prashant Bhardwaj < prashant2006s...@gmail.com> wrote: > I have to do opposite of what you're doing. I have to filter non-empty > records. > > On Wed, Dec

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Prashant Bhardwaj
Anyway I got it. I have to use !== instead of ===. Thank BTW. On Wed, Dec 9, 2015 at 9:39 PM, Prashant Bhardwaj < prashant2006s...@gmail.com> wrote: > I have to do opposite of what you're doing. I have to filter non-empty > records. > > On Wed, Dec 9, 2015 at 9:33 PM, Gokula Krishnan D > wrote:

Re: can i write only RDD transformation into hdfs or any other storage system

2015-12-09 Thread kali.tumm...@gmail.com
Hi Prateek, you mean writing spark output to any storage system ? yes you can . Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-i-write-only-RDD-transformation-into-hdfs-or-any-other-storage-system-tp25637p25651.html Sent from the Apache Sp

Re: Spark Stream Monitoring with Kafka Direct API

2015-12-09 Thread Cody Koeninger
Which version of spark are you on? I thought that was added to the spark UI in recent versions. DIrect api doesn't have any inherent interaction with zookeeper. If you need number of messages per batch and aren't on a recent enough version of spark to see them in the ui, you can get them program

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Prashant Bhardwaj
I have to do opposite of what you're doing. I have to filter non-empty records. On Wed, Dec 9, 2015 at 9:33 PM, Gokula Krishnan D wrote: > Hello Prashant - > > Can you please try like this : > > For the instance, input file name is "student_detail.txt" and > > ID,Name,Sex,Age > === >

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Gokula Krishnan D
Hello Prashant - Can you please try like this : For the instance, input file name is "student_detail.txt" and ID,Name,Sex,Age === 101,Alfred,Male,30 102,Benjamin,Male,31 103,Charlie,Female,30 104,Julie,Female,30 105,Maven,Male,30 106,Dexter,Male,30 107,Lundy,Male,32 108,Rita,Female,3

Spark Stream Monitoring with Kafka Direct API

2015-12-09 Thread Dan Dutrow
Is there documentation for how to update the metrics (#messages per batch) in the Spark Streaming tab when using the Direct API? Does the Streaming tab get its information from Zookeeper or something else internally? -- Dan ✆

default parallelism and mesos executors

2015-12-09 Thread Adrian Bridgett
(resending, text only as first post on 2nd never seemed to make it) Using parallelize() on a dataset I'm only seeing two tasks rather than the number of cores in the Mesos cluster. This is with spark 1.5.1 and using the mesos coarse grained scheduler. Running pyspark in a console seems to sh

Re: is repartition very cost

2015-12-09 Thread Daniel Siegmann
Each node can have any number of partitions. Spark will try to have a node process partitions which are already on the node for best performance (if you look at the list of tasks in the UI, look under the locality level column). As a rule of thumb, you probably want 2-3 times the number of partiti

How to interpret executorRunTime?

2015-12-09 Thread Saraswat, Sandeep
Hi, I'm using Spark 1.5.1 and if I look at the JSON data for a running application, every Stage has an "executorRunTime" field associated with it which is typically a 7-digit number for the PageRank application running on a large (1.1 GB) input. Does this represent the execution-time for the st

spark data frame write.mode("append") bug

2015-12-09 Thread kali.tumm...@gmail.com
Hi Spark Contributors, I am trying to append data to target table using df.write.mode("append") functionality but spark throwing up table already exists exception. Is there a fix scheduled in later spark release ?, I am using spark 1.5. val sourcedfmode=sourcedf.write.mode("append") sourcedfmod

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Prashant Bhardwaj
Already tried it. But getting following error. overloaded method value filter with alternatives: (conditionExpr: String)org.apache.spark.sql.DataFrame (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame cannot be applied to (Boolean) Also tried: val req_logs_with_dpid = req_l

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Fengdong Yu
val req_logs_with_dpid = req_logs.filter(req_logs("req_info.pid") != "" ) Azuryy Yu Sr. Infrastructure Engineer cel: 158-0164-9103 wetchat: azuryy On Wed, Dec 9, 2015 at 7:43 PM, Prashant Bhardwaj < prashant2006s...@gmail.com> wrote: > Hi > > I have two columns in my json which can have null,

ALS with repeated entries

2015-12-09 Thread Roberto Pagliari
What happens with ALS when the same pair of user/item appears more than once with either the same ratings or different ratings?

HiveContext.read.orc - buffer size not respected after setting it

2015-12-09 Thread Fabian Böhnlein
Hello everyone, I'm hitting below exception when reading an ORC file with default HiveContext after setting hive.exec.orc.default.buffer.size to 1517137. See below for details. Is there another buffer parameter relevant or another place where I could set it? Any other ideas what's going wr

Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Prashant Bhardwaj
Hi I have two columns in my json which can have null, empty and non-empty string as value. I know how to filter records which have non-null value using following: val req_logs = sqlContext.read.json(filePath) val req_logs_with_dpid = req_log.filter("req_info.dpid is not null or req_info.dpid_sha

Re: How to use collections inside foreach block

2015-12-09 Thread Ted Yu
To add onto what Rishi said, you can use foreachPartition() on result where you can save values to DB. Cheers On Wed, Dec 9, 2015 at 12:51 AM, Rishi Mishra wrote: > Your list is defined on the driver, whereas function specified in forEach > will be evaluated on each executor. > You might want t

Re: getting error while persisting in hive

2015-12-09 Thread Fengdong Yu
.write not .write() > On Dec 9, 2015, at 5:37 PM, Divya Gehlot wrote: > > Hi, > I am using spark 1.4.1 . > I am getting error when persisting spark dataframe output to hive > scala> > df.select("name","age").write().format("com.databricks.spark.csv").mode(SaveMode.Append).saveAsTable("Per

getting error while persisting in hive

2015-12-09 Thread Divya Gehlot
Hi, I am using spark 1.4.1 . I am getting error when persisting spark dataframe output to hive > scala> > df.select("name","age").write().format("com.databricks.spark.csv").mode(SaveMode.Append).saveAsTable("PersonHiveTable"); > :39: error: org.apache.spark.sql.DataFrameWriter does not take > para

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Thank you for your reply! I have already done the change locally. So for changing it would be fine. I just wanted to be sure which way is correct. On 9 Dec 2015 18:20, "Fengdong Yu" wrote: > I don’t think there is performance difference between 1.x API and 2.x API. > > but it’s not a big issue

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Fengdong Yu
I don’t think there is performance difference between 1.x API and 2.x API. but it’s not a big issue for your change, only com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java

Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Hi all, I am writing this email to both user-group and dev-group since this is applicable to both. I am now working on Spark XML datasource ( https://github.com/databricks/spark-xml). This uses a InputFormat implementation which I downgraded to Hadoop 1.x for version compatibility. However, I fo

Re: How to use collections inside foreach block

2015-12-09 Thread Rishi Mishra
Your list is defined on the driver, whereas function specified in forEach will be evaluated on each executor. You might want to add an accumulator or handle a Sequence of list from each partition. On Wed, Dec 9, 2015 at 11:19 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I

Re: spark-defaults.conf optimal configuration

2015-12-09 Thread cjrumble
Hello Neelesh, Thank you for the checklist for determining the correct configuration of Spark. I will go through these and let you know if I have further questions. Regards, Chris -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-defaults-conf-optim

Re: About Spark On Hbase

2015-12-09 Thread censj
Thank you! I know > 在 2015年12月9日,15:59,fightf...@163.com 写道: > > If you are using maven , you can add the cloudera maven repo to the > repository in pom.xml > and add the dependency of spark-hbase. > I just found this : > http://spark-packages.org/package/nerdammer/spark-hbase-connector >

回复: Re: About Spark On Hbase

2015-12-09 Thread fightf...@163.com
If you are using maven , you can add the cloudera maven repo to the repository in pom.xml and add the dependency of spark-hbase. I just found this : http://spark-packages.org/package/nerdammer/spark-hbase-connector as Feng Dongyu recommend, you can try this also, but I had no experience of us