Re: streaming spark is writing results to S3 a good idea?

2016-02-23 Thread Sabarish Sasidharan
Writing to S3 is over the network. So will obviously be slower than local disk. That said, within AWS the network is pretty fast. Still you might want to write to S3 only after a certain threshold in data is reached, so that it's efficient. You might also want to use the DirectOutputCommitter as it

Re: streaming spark is writing results to S3 a good idea?

2016-02-23 Thread Sabarish Sasidharan
And yes, storage grows on demand. No issues with that. Regards Sab On 24-Feb-2016 6:57 am, "Andy Davidson" wrote: > Currently our stream apps write results to hdfs. We are running into > problems with HDFS becoming corrupted and running out of space. It seems > like a better solution might be to

Re: Performing multiple aggregations over the same data

2016-02-23 Thread Michał Zieliński
Do you mean something like this? data.agg(sum("var1"),sum("var2"),sum("var3")) On 24 February 2016 at 01:49, Daniel Imberman wrote: > Hi guys, > > So I'm running into a speed issue where I have a dataset that needs to be > aggregated multiple times. > > Initially my team had set up three accumu

[Vote] : Spark-csv 1.3 + Spark 1.5.2 - Error parsing null values except String data type

2016-02-23 Thread Divya Gehlot
Hi, Please vote if you have ever faced this issue. I am getting error when parsing null values with Spark-csv DataFile : name age alice 35 bob null peter 24 Code : spark-shell --packages com.databricks:spark-csv_2.10:1.3.0 --master yarn-client -i /TestDivya/Spark/Testnull.scala Testnull.scala

Re: Using functional programming rather than SQL

2016-02-23 Thread Sabarish Sasidharan
When using SQL your full query, including the joins, were executed in Hive(or RDBMS) and only the results were brought into the Spark cluster. In the FP case, the data for the 3 tables is first pulled into the Spark cluster and then the join is executed. Thus the time difference. It's not immedia

Re: About Tensor Factorization in Spark

2016-02-23 Thread Nick Pentreath
Not that I'm aware of - it would be a great addition as a Spark package! On Wed, 24 Feb 2016 at 06:33 Li Jiajia wrote: > Thanks Nick. I found this one. This library is focusing on a particular > application I guess, seems only implemented one tensor factorization > algorithm by far, and only for

Re: how to interview spark developers

2016-02-23 Thread Xiao Li
This is interesting! I believe the interviewees should AT LEAST subscribe this mailing list, if they are spark developers. Then, they will know your questions before the interview. : ) 2016-02-23 22:07 GMT-08:00 charles li : > hi, there, we are going to recruit several spark developers, can some

how to interview spark developers

2016-02-23 Thread charles li
hi, there, we are going to recruit several spark developers, can some one give some ideas on interviewing candidates, say, spark related problems. great thanks. -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

RE: Reindexing in graphx

2016-02-23 Thread Udbhav Agarwal
Thank you Robin for your reply. Actually I am adding bunch of vertices in a graph in graphx using the following method . I am facing the problem of latency. First time an addition of say 400 vertices to a graph with 100,000 nodes takes around 7 seconds. next time its taking 15 seconds. So every

Re: Read from kafka after application is restarted

2016-02-23 Thread Chitturi Padma
Hi Vaibhav, As you said, from the second link, I can figure out that, it is not able to cast the class when it is trying to read from checkpoint. Can you try explicit casting like asInstanceOf[T] for the broad casted value ? >From the bug, looks like it affects version 1.5. Try sample wordcoun

Re: About Tensor Factorization in Spark

2016-02-23 Thread Li Jiajia
Thanks Nick. I found this one. This library is focusing on a particular application I guess, seems only implemented one tensor factorization algorithm by far, and only for three dimensional tensors. Are there Spark powered libraries supporting general tensors and the algorithms? Best regards! J

Re: About Tensor Factorization in Spark

2016-02-23 Thread Nick Pentreath
There is this library that I've come across - https://github.com/FurongHuang/SpectralLDA-TensorSpark On Wed, 24 Feb 2016 at 05:50, Li Jiajia wrote: > Hi, > I wonder if there are tensor algorithms or tensor data structures > supported by Spark MLlib or GraphX. In a Spark intro slide, tensor > fac

About Tensor Factorization in Spark

2016-02-23 Thread Li Jiajia
Hi, I wonder if there are tensor algorithms or tensor data structures supported by Spark MLlib or GraphX. In a Spark intro slide, tensor factorization is mentioned as one of the algorithms in GraphX, but I didn't find it in the guide. If not, do you plan to implement them in the future?

Re: Using functional programming rather than SQL

2016-02-23 Thread Koert Kuipers
​instead of: var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM sales") you should be able to do something like: val s = HiveContext.table("sales").select("AMOUNT_SOLD", "TIME_ID", "CHANNEL_ID") its not obvious to me why the dataframe (aka FP) version would be significantly slow

Re: which master option to view current running job in Spark UI

2016-02-23 Thread Jeff Zhang
View running job in SPARK UI doesn't matter which master you use. What do you mean "I cant see the currently running jobs in Spark WEB UI" ? Do you see a blank spark ui or can't open the spark ui ? On Mon, Feb 15, 2016 at 12:55 PM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: >

Re: Streaming mapWithState API has NullPointerException

2016-02-23 Thread Tathagata Das
Yes, you should be okay to test your code. :) On Mon, Feb 22, 2016 at 5:57 PM, Aris wrote: > If I build from git branch origin/branch-1.6 will I be OK to test out my > code? > > Thank you so much TD! > > Aris > > On Mon, Feb 22, 2016 at 2:48 PM, Tathagata Das < > tathagata.das1...@gmail.com> wro

Re: How to join multiple tables and use subqueries in Spark SQL using sqlContext?

2016-02-23 Thread swetha kasireddy
It seems to be failing when I do something like following in both sqlContext and hiveContext sqlContext.sql("SELECT ssd.savedDate from saveSessionDatesRecs ssd where ssd.partitioner in (SELECT sr1.partitioner from sparkSessionRecords1 sr1))") On Tue, Feb 23, 2016 at 5:57 PM, swetha kasireddy

Re: metrics not reported by spark-cassandra-connector

2016-02-23 Thread Sa Xiao
Hi Yin, Thanks for your reply. I didn't realize there is a specific mailing list for spark-Cassandra-connector. I will ask there. Thanks! -Sa On Tuesday, February 23, 2016, Yin Yang wrote: > Hi, Sa: > Have you asked on spark-cassandra-connector mailing list ? > > Seems you would get better res

Re: metrics not reported by spark-cassandra-connector

2016-02-23 Thread Yin Yang
Hi, Sa: Have you asked on spark-cassandra-connector mailing list ? Seems you would get better response there. Cheers

Re: How to join multiple tables and use subqueries in Spark SQL using sqlContext?

2016-02-23 Thread swetha kasireddy
These tables are stored in hdfs as parquet. Can sqlContext be applied for the subQueries? On Tue, Feb 23, 2016 at 5:31 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Assuming these are all in Hive, you can either use spark-sql or > spark-shell. > > HiveContext has r

streaming spark is writing results to S3 a good idea?

2016-02-23 Thread Andy Davidson
Currently our stream apps write results to hdfs. We are running into problems with HDFS becoming corrupted and running out of space. It seems like a better solution might be to write directly to S3. Is this a good idea? We plan to continue to write our checkpoints to hdfs Are there any issues to

metrics not reported by spark-cassandra-connector

2016-02-23 Thread Sa Xiao
Hi there, I am trying to enable the metrics collection by spark-cassandra-connector, following the instruction here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/11_metrics.md However, I was not able to see any metrics reported. I'm using spark-cassandra-connector_2.10:1.

spark-1.6.0-bin-hadoop2.6/ec2/spark-ec2 uses old version of hadoop

2016-02-23 Thread Andy Davidson
I do not have any hadoop legacy code. My goal is to run spark on top of HDFS. Recently I have been have hdfs corruption problem. I was also never able to access S3 even though I used --copy-aws-credentials. I noticed that by default the spark-ec2 script uses hadoop 1.0.4. I ran help and discovered

How to join multiple tables and use subqueries in Spark SQL using sqlContext?

2016-02-23 Thread SRK
Hi, How do I join multiple tables and use subqueries in Spark SQL using sqlContext? Can I do this using sqlContext or do I have to use HiveContext for the same? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-multiple-tables-and-use-s

Performing multiple aggregations over the same data

2016-02-23 Thread Daniel Imberman
Hi guys, So I'm running into a speed issue where I have a dataset that needs to be aggregated multiple times. Initially my team had set up three accumulators and were running a single foreach loop over the data. Something along the lines of val accum1:Accumulable[a] val accum2: Accumulable[b] va

Re: Using functional programming rather than SQL

2016-02-23 Thread Mich Talebzadeh
Hi, First thanks everyone for their suggestions. Much appreciated. This was the original queries written in SQL and run against Spark-shell val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) println ("nStarted at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/M

Re: Error decompressing .gz source data files

2016-02-23 Thread rheras
HI again, today I've tried using bzip2 files instead of gzip, but the problem is the same, really I don't understand where is the problem :( - logs through the master web: 16/02/23 23:48:01 INFO compress.CodecPool: Got brand-new decompressor [.bz2] 16/02/23 23:48:01 INFO compress.CodecPool: Got

Re: [Spark 1.5+] ReceiverTracker seems not to stop Kinesis receivers

2016-02-23 Thread Roberto Coluccio
Any chance anyone gave a look at this? Thanks! On Wed, Feb 10, 2016 at 10:46 AM, Roberto Coluccio < roberto.coluc...@gmail.com> wrote: > Thanks Shixiong! > > I'm attaching the thread dumps (I printed the Spark UI after expanding all > the elements, hope that's fine) and related stderr (INFO leve

Spark 1.5.2, DataFrame broadcast join, OOM

2016-02-23 Thread Yong Zhang
Hi, I am testing the Spark 1.5.2 using a 4 nodes cluster, with 64G memory each, and one is master and 3 are workers. I am using Standalone mode. Here is my spark-env.sh for settings: export SPARK_LOCAL_DIRS=/data1/spark/local,/data2/spark/local,/data3/spark/localexport SPARK_MASTER_WEBUI_PORT=8

Re: Spark standalone peer2peer network

2016-02-23 Thread tdelacour
Thank you for your quick reply! we have been able to get the master running, and able to log in to the web ui and start a slave on the same machine as the master. We also see the slave appear on the master ui if the slave is running on the same computer. However when we start a slave on a differen

Association with remote system [akka.tcp://. . .] has failed

2016-02-23 Thread Jeff Henrikson
Hello spark users, (Apologies if a duplicate of this message just came through) I am testing the behavior of remote job submission with ec2/spark_ec2.py in spark distribution 1.5.2. I submit SparkPi to a remote ec2 instance using spark-submit using the "standalone mode" (spark://) protocol. C

Re: value from groubBy paired rdd

2016-02-23 Thread Jorge Machado
Hi Mishra, I haven’t tested anything but : > grouped_val= grouped.map(lambda x : (list(x[1]))).collect() what is x[1] ? data = sc.textFile('file:///home/cloudera/LDA-Model/Pyspark/test1.csv') header = data.first() #extract header data = data.filter(lambda x:x !=header)#filter out header

Apache Arrow + Spark examples?

2016-02-23 Thread Robert Towne
I have been reading some of the news this week about Apache Arrow as a new top level project. It appears to be a common data layer between Spark and other systems (Cassandra, Drill, Impala, etc). Has anyone seen any sample Spark code that integrates with Arrow? Thanks

Re: Network Spark Streaming from multiple remote hosts

2016-02-23 Thread Kevin Mellott
Hi Vinti, That example is (in my opinion) more of a tutorial and not necessarily the way you'd want to set it up for a "real world" application. I'd recommend using something like Apache Kafka, which will allow the various hosts to publish messages to a queue. Your Spark Streaming application is t

Network Spark Streaming from multiple remote hosts

2016-02-23 Thread Vinti Maheshwari
Hi All I wrote program for Spark Streaming in Scala. In my program, i passed 'remote-host' and 'remote port' under socketTextStream. And in the remote machine, i have one perl script who is calling system command: echo 'data_str' | nc <> In that way, my spark program is able to get data, b

Association with remote system [akka.tcp://. . .] has failed

2016-02-23 Thread Jeff Henrikson
Hello spark-users, I am testing the behavior of remote job submission with ec2/spark_ec2.py in spark distribution 1.5.2. I submit SparkPi to a remote ec2 instance using spark-submit using the "standalone mode" (spark://) protocol. Connecting to the master via ssh works, but submission fails.

Re: Spark standalone peer2peer network

2016-02-23 Thread Gourav Sengupta
Hi, Setting password less ssh access to your laptop may be a personal risk. I would suppose that you can install Ubuntu over Virtualbox and set the networking option to Bridged so that there are no issues. For setting passwordless ssh see the following options (source: http://www.michael-noll.com

Re: Spark standalone peer2peer network

2016-02-23 Thread Robineast
Hi Thomas I can confirm that I have had this working in the past. I'm pretty sure you don't need password-less SSH for running a standalone cluster manually. Try running the instructions at http://spark.apache.org/docs/latest/spark-standalone.html for Starting a Cluster manually. do you get the m

Re: How to get progress information of an RDD operation

2016-02-23 Thread Ted Yu
I think Ningjun was looking for programmatic way of tracking progress. I took a look at: ./core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala but there doesn't seem to exist fine grained events directly reflecting what Ningjun looks for. On Tue, Feb 23, 2016 at 11:24 AM, Kevin Me

Spark standalone peer2peer network

2016-02-23 Thread tdelacour
Some teammates and I are trying to create a spark cluster across ordinary macbooks. We were wondering if there is any precedent or guide for doing this, as our internet searches have not been particularly conclusive. So far all attempts to use standalone mode have not worked. We suspect that this h

Re: reasonable number of executors

2016-02-23 Thread Igor Berman
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications there is a section that is connected to your question On 23 February 2016 at 16:49, Alex Dzhagriev wrote: > Hello all, > > Can someone please advise me on the pros and cons on how to allocate the >

value from groubBy paired rdd

2016-02-23 Thread Mishra, Abhishek
Hello All, I am new to spark and python, here is my doubt, please suggest... I have a csv file which has 2 column "user_id" and "status". I have read it into a rdd and then removed the header of the csv file. Then I split the record by "," (comma) and generate pair rdd. On that rdd I groupByKe

Re: How to get progress information of an RDD operation

2016-02-23 Thread Kevin Mellott
Have you considered using the Spark Web UI to view progress on your job? It does a very good job showing the progress of the overall job, as well as allows you to drill into the individual tasks and server activity. On Tue, Feb 23, 2016 at 12:53 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexi

How to get progress information of an RDD operation

2016-02-23 Thread Wang, Ningjun (LNG-NPV)
How can I get progress information of a RDD operation? For example val lines = sc.textFile("c:/temp/input.txt") // a RDD of millions of line lines.foreach(line => { handleLine(line) }) The input.txt contains millions of lines. The entire operation take 6 hours. I want to print out h

Re: Serializing collections in Datasets

2016-02-23 Thread Daniel Siegmann
Yes, I will test once 1.6.1 RC1 is released. Thanks. On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust wrote: > I think this will be fixed in 1.6.1. Can you test when we post the first > RC? (hopefully later today) > > On Mon, Feb 22, 2016 at 1:51 PM, Daniel Siegmann < > daniel.siegm...@teamaol

Count job stalling at shuffle stage on 3.4TB input (but only 5.3GB shuffle write)

2016-02-23 Thread James Hammerton
Hi, I have been having problems processing a 3.4TB data set - uncompressed tab separated text - containing object creation/update events from our system, one event per line. I decided to see what happens with a count of the number of events (= number of lines in the text files) and a count of the

Re: Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-23 Thread Burak Yavuz
You could use the Bucketizer transformer in Spark ML. Best, Burak On Tue, Feb 23, 2016 at 9:13 AM, Arunkumar Pillai wrote: > Hi > Is there any predefined method to calculate histogram bins and frequency > in spark. Currently I take range and find bins then count frequency using > SQL query. > >

Re: Use maxmind geoip lib to process ip on Spark/Spark Streaming

2016-02-23 Thread romain sagean
I realize I forgot the sbt part resolvers += "SnowPlow Repo" at "http://maven.snplow.com/releases/"; libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.3.0", "com.snowplowanalytics" %% "scala-maxmind-iplookups" % "0.2.0" ) otherwise, to process streaming log I use logsta

Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-23 Thread Arunkumar Pillai
Hi Is there any predefined method to calculate histogram bins and frequency in spark. Currently I take range and find bins then count frequency using SQL query. Is there any better way

Fast way to parse JSON in Spark

2016-02-23 Thread Jerry
Hi, I had a Java parser using GSON and packaged it as java lib (e.g. messageparserLib.jar). I use this lib in the Spark streaming and parse the coming json messages. This is very slow and lots of time lag in parsing/inserting messages to Cassandra. What is the fast way to parse JSON messages in S

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Daniel Siegmann
During testing you will typically be using some finite data. You want the stream to shut down automatically when that data has been consumed so your test shuts down gracefully. Of course once the code is running in production you'll want it to keep waiting for new records. So whether the stream sh

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Ashutosh Kumar
Just out of curiosity I will like to know why a streaming program should shutdown when no new data is arriving? I think it should keep waiting for arrival of new records. Thanks Ashutosh On Tue, Feb 23, 2016 at 9:17 PM, Hemant Bhanawat wrote: > A guess - parseRecord is returning None in some c

Re: pandas dataframe to spark csv

2016-02-23 Thread Devesh Raj Singh
Hi, I have already imported data using sparkcsv package. I need to convert pandas dataframe back to sparkcsv anf write it to a location. On Tuesday, February 23, 2016, Gourav Sengupta wrote: > Hi, > > The solutions is here: https://github.com/databricks/spark-csv > > Using the above solution yo

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Hemant Bhanawat
A guess - parseRecord is returning None in some case (probaly empty lines). And then entry.get is throwing the exception. You may want to filter the None values from accessLogDStream before you run the map function over it. Hemant Hemant Bhanawat

Re: Percentile calculation in spark 1.6

2016-02-23 Thread Ted Yu
Please take a look at the following if you can utilize Hive hdf: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUdfSuite.scala On Tue, Feb 23, 2016 at 6:28 AM, Chandeep Singh wrote: > This should help - > http://stackoverflow.com/questions/28805602/how-to-compute-percentiles-in-

Re: reasonable number of executors

2016-02-23 Thread Jorge Machado
Hi Alex, take a look here : https://blogs.aws.amazon.com/bigdata/post/Tx3RD6EISZGHQ1C/The-Impact-of-Using-Latest-Generation-Instances-for-Your-Amazon-EMR-Job Bas

reasonable number of executors

2016-02-23 Thread Alex Dzhagriev
Hello all, Can someone please advise me on the pros and cons on how to allocate the resources: many small heap machines with 1 core or few machines with big heaps and many cores? I'm sure that depends on the data flow and there is no best practise solution. E.g. with bigger heap I can perform map-

Re: Percentile calculation in spark 1.6

2016-02-23 Thread Chandeep Singh
This should help - http://stackoverflow.com/questions/28805602/how-to-compute-percentiles-in-apache-spark > On Feb 23, 2016, at 10:08 AM, Arunkumar Pillai > wrote: > > How to calculate percentile in spark

Re: Use maxmind geoip lib to process ip on Spark/Spark Streaming

2016-02-23 Thread Romain Sagean
Hi, I use maxmind geoip with spark (no streaming). To make it work you should use mapPartition. I don't know if something similar exist for spark streaming. my code for reference: def parseIP(ip:String, ipLookups: IpLookups): List[String] = { val lookupResult = ipLookups.performLookups(ip)

Re: Option[Long] parameter in case class parsed from JSON DataFrame failing when key not present in JSON

2016-02-23 Thread Anthony Brew
Thank you Jakob, you were bang on the money. Jorge appologies my snippets was partial and I hadn't made it equivelent to my failing test. For reference and for all that pass this way, here is the (a) working solution with passing tests without inferring a schema, it was the second test that had be

Re: Read from kafka after application is restarted

2016-02-23 Thread Ted Yu
For receiver approach, have you tried Ryan's workaround ? Btw I don't see the errors you faced because there was no attachment. > On Feb 23, 2016, at 3:39 AM, vaibhavrtk1 wrote: > > Hello > > I have tried with Direct API but i am getting this an error, which is being > tracked here https://

Re: Spark Job Hanging on Join

2016-02-23 Thread Dave Moyers
Congrats! Sent from my iPad > On Feb 23, 2016, at 2:43 AM, Mohannad Ali wrote: > > Hello Everyone, > > Thanks a lot for the help. We also managed to solve it but without resorting > to spark 1.6. > > The problem we were having was because of a really bad join condition: > > ON ((a.col1 = b.

Use maxmind geoip lib to process ip on Spark/Spark Streaming

2016-02-23 Thread Zhun Shen
Hi all, Currently, I sent nginx log to Kafka then I want to use Spark Streaming to parse the log and enrich the IP info with geoip libs from Maxmind. I found this one https://github.com/Sanoma-CDA/maxmind-geoip2-scala.git , but spark stre

Query Kafka Partitions from Spark SQL

2016-02-23 Thread Abhishek Anand
Is there a way to query the json (or any other format) data stored in kafka using spark sql by providing the offset range on each of the brokers ? I just want to be able to query all the partitions in a sq manner. Thanks ! Abhi

Re: pandas dataframe to spark csv

2016-02-23 Thread Gourav Sengupta
Hi, The solutions is here: https://github.com/databricks/spark-csv Using the above solution you can read CSV directly into a dataframe as well. Regards, Gourav On Tue, Feb 23, 2016 at 12:03 PM, Devesh Raj Singh wrote: > Hi, > > I have imported spark csv dataframe in python and read the spark

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Ted Yu
Which line is line 42 in your code ? When variable lines becomes empty, you can stop your program. Cheers > On Feb 23, 2016, at 12:25 AM, Femi Anthony wrote: > > I am working on Spark Streaming API and I wish to stream a set of > pre-downloaded web log files continuously to simulate a real-t

Reindexing in graphx

2016-02-23 Thread Udbhav Agarwal
Hi, I am trying to add vertices to a graph in graphx and I want to do reindexing in the graph. I can see there is an option of vertices.reindex() in graphX. But when I am doing graph.vertices.reindex() am getting Java.lang.IllegalArgumentException: requirement failed. Please help me know what I a

pandas dataframe to spark csv

2016-02-23 Thread Devesh Raj Singh
Hi, I have imported spark csv dataframe in python and read the spark data the converted the dataframe to pandas dataframe using toPandas() I want to convert the pandas dataframe back to spark csv and write the csv to a location. Please suggest -- Warm regards, Devesh.

Re: Read from kafka after application is restarted

2016-02-23 Thread vaibhavrtk1
Hello I have tried with Direct API but i am getting this an error, which is being tracked here https://issues.apache.org/jira/browse/SPARK-5594 I also tried using Receiver approach with Write Ahead Logs ,then this issue comes https://issues.apache.org/jira/browse/SPARK-12407 In both cases it see

Re: Accessing Web UI

2016-02-23 Thread Vasanth Bhat
Hi Gourav, The spark version is spark-1.6.0-bin-hadoop2.6 . The Java version is JDK 8. I have also tried with JDK 7 also, but the results are same. Thanks Vasanth On Tue, Feb 23, 2016 at 2:57 PM, Gourav Sengupta wrote: > Hi, > > This should really work out of the box, I have tr

Re: spark 1.6 Not able to start spark

2016-02-23 Thread Steve Loughran
On 23 Feb 2016, at 08:22, Arunkumar Pillai mailto:arunkumar1...@gmail.com>> wrote: at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.reflect.InvocationTargetException at

Percentile calculation in spark 1.6

2016-02-23 Thread Arunkumar Pillai
How to calculate percentile in spark 1.6 ? -- Thanks and Regards Arun

PySpark Pickle reading does not find module

2016-02-23 Thread Fabian Böhnlein
Hi all, how can I make a module/class visible to a sc.pickleFile? It seems to miss it in the env after an import in the driver PySpark context. The module is available for writing, but reading in a new SparkContext than the one that wrote, fails. The imports are the same in both. Any ideas h

Re: [Please Help] Log redirection on EMR

2016-02-23 Thread HARSH TAKKAR
Hi Sabarish Thanks for your help, i was able to get the logs from archive, is there a way i can adjust archival policy, say i want to persist the logs of last 2 jobs on resource manage and archive others on the file system. On Mon, Feb 22, 2016 at 12:25 PM Sabarish Sasidharan < sabarish.sasidha.

[Proposal] Enabling time series analysis on spark metrics

2016-02-23 Thread Karan Kumar
HI Spark at the moment uses application ID to report metrics. I was thinking that if we can create an option to export metrics on a user-controlled key. This will allow us to do time series analysis on counters by dumping these counters in a DB such as graphite. One of the approaches I had in min

Re: Accessing Web UI

2016-02-23 Thread Gourav Sengupta
Hi, This should really work out of the box, I have tried SPARK installations in cluster and stand alone mode in MAC, Debian, Ubuntu boxes without any issues. Can you please let me know which version of SPARK you are using? Regards, Gourav On Tue, Feb 23, 2016 at 9:02 AM, Vasanth Bhat wrote:

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-23 Thread @Sanjiv Singh
Yes, It is very strange and also very opposite to my belief on Spark SQL on hive tables. I am facing this issue on HDP setup on which COMPACTION is required only once. On the other hand, Apache setup doesn't required compaction even once. May be something got triggered on meta-store after compact

Re: Spark Job Hanging on Join

2016-02-23 Thread Alonso Isidoro Roman
thanks for sharing the know how guys Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..." - Edsger Dijkstra My favorite quotes (today): "If debugging is the process of remo

Re: Accessing Web UI

2016-02-23 Thread Vasanth Bhat
Hi, Is there a way to provide minThreads and maxThreds for Threadpool through jetty.xml for the jetty that is used by spark Web UI? I am hitting an issue very similar to the issue described in http://lifelongprogrammer.blogspot.com/2014/10/jetty-insufficient-threads-configured.

Dataset sorting

2016-02-23 Thread Oliver Beattie
Hi, Unless I'm missing something, there appears to be no way to sort a Dataset, without first converting it to a DataFrame. Is this something that is planned? Thanks

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-23 Thread Varadharajan Mukundan
That's interesting. I'm not sure why first compaction is needed but not on the subsequent inserts. May be its just to create few metadata. Thanks for clarifying this :) On Tue, Feb 23, 2016 at 2:15 PM, @Sanjiv Singh wrote: > Try this, > > > hive> create table default.foo(id int) clustered by (id

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-23 Thread @Sanjiv Singh
Try this, hive> create table default.foo(id int) clustered by (id) into 2 buckets STORED AS ORC TBLPROPERTIES ('transactional'='true'); hive> insert into default.foo values(10); scala> sqlContext.table("default.foo").count // Gives 0, which is wrong because data is still in delta files Now run

Re: Spark Job Hanging on Join

2016-02-23 Thread Mohannad Ali
Hello Everyone, Thanks a lot for the help. We also managed to solve it but without resorting to spark 1.6. The problem we were having was because of a really bad join condition: ON ((a.col1 = b.col1) or (a.col1 is null and b.col1 is null)) AND ((a.col2 = b.col2) or (a.col2 is null and b.col2 is

Re: Kafka streaming receiver approach - new topic not read from beginning

2016-02-23 Thread Paul Leclercq
I successfully processed my data by resetting manually my topic offsets on ZK. If it may help someone, here's my steps : Make sure you stop all your consumers before doing that, otherwise they overwrite the new offsets you wrote set /consumers/{yourConsumerGroup}/offsets/{yourFancyTopic}/{partit

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-23 Thread Varadharajan Mukundan
This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure if its different. Please walkthrough the below commands in the same order to understand the sequence. hive> create table default.foo(id int) clustered by (id) into 2 buckets STORED AS ORC TBLPROPERTIES ('transactional'='true

Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Femi Anthony
I am working on Spark Streaming API and I wish to stream a set of pre-downloaded web log files continuously to simulate a real-time stream. I wrote a script that gunzips the compressed logs and pipes the output to nc on port . The script looks like this: BASEDIR=/home/mysuer/data/datamining/i

Re: spark 1.6 Not able to start spark

2016-02-23 Thread Arunkumar Pillai
I'm using hadoop 2.7 Exception in thread "main" java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) at org.apache.hadoop.security.Groups.(Groups.java:70) at org.apache.hadoop

Re: Read from kafka after application is restarted

2016-02-23 Thread Gideon
Regarding the spark streaming receiver - can't you just use Kafka direct receivers with checkpoints? So when you restart your application it will read where it last stopped and continue from there Regarding limiting the number of messages - you can do that by setting spark.streaming.receiver.maxRat