Re: Possible issue for Spark SQL/DataFrame

2015-08-10 Thread Akhil Das
Isnt it a space separated data? It is not a comma(,) separated nor pipe (|) separated data. Thanks Best Regards On Mon, Aug 10, 2015 at 12:06 PM, Netwaver wrote: > Hi Spark experts, > I am now using Spark 1.4.1 and trying Spark SQL/DataFrame > API with text file in below format

Re: Spark on YARN

2015-08-10 Thread Jem Tucker
Hi, I have looked at the UI scheduler tab and it appears my new user was allocated less cores than my other user, is there any way i can avoid this happening? Thanks, Jem On Sat, Aug 8, 2015 at 8:32 PM Shushant Arora wrote: > which is the scheduler on your cluster. Just check on RM UI schedul

Re: multiple dependency jars using pyspark

2015-08-10 Thread ayan guha
Easiest way should be to add both jars in SPARK_CLASSPATH as a colon separated string. On 10 Aug 2015 06:20, "Jonathan Haddad" wrote: > I'm trying to write a simple job for Pyspark 1.4 migrating data from MySQL > to Cassandra. I can work with either the MySQL JDBC jar or the cassandra > jar sepa

Spark Streaming Restart at scheduled intervals

2015-08-10 Thread Pankaj Narang
Hi All, I am creating spark twitter streaming connection in my app over long period of time. When I have some new keywords I need to add them to the spark streaming connection. I need to stop and start the current twitter streaming connection in this case. I have tried akka actor scheduling but c

Kinesis records are merged with out obvious way of separating them

2015-08-10 Thread raam
I am using spark 1.4.1 with connector 1.4.0 When I post events slowly and the are being picked one by one everything runs smoothly but when the stream starts delivering batched records there is no obvious way to separate them. Am i missing something? How do I separate the records when they are ju

question about spark streaming

2015-08-10 Thread sequoiadb
hi guys, i have a question about spark streaming. There’s an application keep sending transaction records into spark stream with about 50k tps The record represents a sales information including customer id / product id / time / price columns The application is required to monitor the change of

Differents of loading data

2015-08-10 Thread 李铖
What is the differents of loading data using jdbc and loading data using spard data source api? or differents of loading data using mongo-hadoop and loading data using native java driver? Which way is better?

Spark with GCS Connector - Rate limit error

2015-08-10 Thread Oren Shpigel
Hi, I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended here ), and get a lot of "rate limit" errors, as added below. The errors relate to temp files

How to connect to spark remotely from java

2015-08-10 Thread Zsombor Egyed
Hi! I want to know how can I connect to hortonworks spark from an other machine. So there is a HDP 2.2 and I want to connect to this, from remotely via java api. Do you have any suggestion? Thanks! Regards, -- *Egyed Zsombor * Junior Big Data Engineer Mobile: +36 70 320 65 81 | Twitter:@s

EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Yasemin Kaya
Hi, I have EC2 cluster, and am using spark 1.3, yarn and HDFS . When i submit at local there is no problem , but i run at cluster, saveAsTextFile doesn't work."*It says me User class threw exception: Output directory hdfs://172.31.42.10:54310/./weblogReadResult

Re: How to connect to spark remotely from java

2015-08-10 Thread Simon Elliston Ball
You don't connect to spark exactly. The spark client (running on your remote machine) submits jobs to the YARN cluster running on HDP. What you probably need is yarn-cluster or yarn-client with the yarn client configs as downloaded from the Ambari actions menu. Simon > On 10 Aug 2015, at 12:44

How to programmatically create, submit and report on Spark jobs?

2015-08-10 Thread mark
Hi All I need to be able to create, submit and report on Spark jobs programmatically in response to events arriving on a Kafka bus. I also need end-users to be able to create data queries that launch Spark jobs 'behind the scenes'. I would expect to use the same API for both, and be able to provi

Re: How to connect to spark remotely from java

2015-08-10 Thread Zsombor Egyed
Thank you for your respond! If I understand well I should get YARN cluster on server/HDP. I should start the yarn services, nodemanager, resourcemanager etc. and I also need to install a spark on my machine, write a java code, make a jar file, and submit it to the server? Am I right? On Mon, Au

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Dean Wampler
Following Hadoop conventions, Spark won't overwrite an existing directory. You need to provide a unique output path every time you run the program, or delete or rename the target directory before you run the job. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: question about spark streaming

2015-08-10 Thread Dean Wampler
Have a look at the various versions of PairDStreamFunctions.updateStateByWindow ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions). It supports updating running state in memory. (You can persist the state to a database/files periodica

Spark Cassandra Connector issue

2015-08-10 Thread satish chandra j
HI All, Please help me to fix Spark Cassandra Connector issue, find the details below *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar *Error:* WARN 2015-08

Re: Spark Cassandra Connector issue

2015-08-10 Thread Dean Wampler
Add the other Cassandra dependencies (dse.jar, spark-cassandra-connect-java_2.10) to your --jars argument on the command line. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler

spark-kafka directAPI vs receivers based API

2015-08-10 Thread Mohit Durgapal
Hi All, I just wanted to know how does directAPI for spark streaming compare with earlier receivers based API. Has anyone used directAPI based approach on production or is it still being used for pocs? Also, since I'm new to spark, could anyone share a starting point from where I could find a wor

Re: Spark Streaming Restart at scheduled intervals

2015-08-10 Thread Dean Wampler
org.apache.spark.streaming.twitter.TwitterInputDStream is a small class. You could write your own that lets you change the filters at run time. Then provide a mechanism in your app, like periodic polling of a database table or file for the list of filters. Dean Wampler, Ph.D. Author: Programming S

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Yasemin Kaya
Thanx Dean, i am giving unique output path and in every time i also delete the directory before i run the job. 2015-08-10 15:30 GMT+03:00 Dean Wampler : > Following Hadoop conventions, Spark won't overwrite an existing directory. > You need to provide a unique output path every time you run the p

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Dean Wampler
So, just before running the job, if you run the HDFS command at a shell prompt: "hdfs dfs -ls hdfs://172.31.42.10:54310/./weblogReadResult". Does it say the path doesn't exist? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reil

Re: spark-kafka directAPI vs receivers based API

2015-08-10 Thread Cody Koeninger
For direct stream questions: https://github.com/koeninger/kafka-exactly-once Yes, it is used in production. For general spark streaming question: http://spark.apache.org/docs/latest/streaming-programming-guide.html On Mon, Aug 10, 2015 at 7:51 AM, Mohit Durgapal wrote: > Hi All, > > I just

Re: Spark Cassandra Connector issue

2015-08-10 Thread satish chandra j
Hi, Thanks for quick input, now I am getting class not found error *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/dse.jar ///home/missingmerch/spark-cassandra-connector-java_2.10-1.1

Re: Spark Cassandra Connector issue

2015-08-10 Thread Dean Wampler
I don't know if DSE changed spark-submit, but you have to use a comma-separated list of jars to --jars. It probably looked for HelloWorld in the second one, the dse.jar file. Do this: dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars /home/missingmerch/ postgresql-9.4-1

spark vs flink low memory available

2015-08-10 Thread Pa Rö
hi community, i have build a spark and flink k-means application. my test case is a clustering on 1 million points on 3node cluster. in memory bottlenecks begins flink to outsource to disk and work slowly but works. however spark lose executers if the memory is full and starts again (infinety loo

spark vs flink low memory available

2015-08-10 Thread Pa Rö
hi community, i have build a spark and flink k-means application. my test case is a clustering on 1 million points on 3node cluster. in memory bottlenecks begins flink to outsource to disk and work slowly but works. however spark lose executers if the memory is full and starts again (infinety loo

Re: Spark Maven Build

2015-08-10 Thread Benyi Wang
Never mind. Instead of set property in the profile cdh5.3.2 2.5.0-cdh5.3.2 ... I have to change the property hadoop.version from 2.2.0 to 2.5.0-cdh5.3.2 in spark-parent's pom.xml. Otherwise, maven will resolve transitive dependencies using the default version 2.2.0.

Spark Streaming dealing with broken files without dying

2015-08-10 Thread Mario Pastorelli
Hey Sparkers, I would like to use Spark Streaming in production to observe a directory and process files that are put inside it. The problem is that some of those files can be broken leading to a IOException from the input reader. This should be fine for the framework I think: the exception should

Re: Estimate size of Dataframe programatically

2015-08-10 Thread Srikanth
SizeEstimator.estimate(df) will not give the size of dataframe rt? I think it will give size of df object. With RDD, I sample() and collect() and sum size of each row. If I do the same with dataframe it will no longer be size when represented in columnar format. I'd also like to know how spark.sq

Re: Estimate size of Dataframe programatically

2015-08-10 Thread Ted Yu
>From a quick glance of SparkStrategies.scala , when statistics.sizeInBytes of the LogicalPlan is <= autoBroadcastJoinThreshold, the plan's output would be used in broadcast join as the 'build' relation. FYI On Mon, Aug 10, 2015 at 8:04 AM, Srikanth wrote: > SizeEstimator.estimate(df) will not

Re: Questions about SparkSQL join on not equality conditions

2015-08-10 Thread gen tang
Hi, I am sorry to bother again. When I do join as follow: df = sqlContext.sql("selet a.someItem, b.someItem from a full outer join b on condition1 *or* condition2") df.first() The program failed at the result size is bigger than spark.driver.maxResultSize. It is really strange, as one record is n

ClosureCleaner does not work for java code

2015-08-10 Thread Hao Ren
Consider two code snippets as the following: // Java code: abstract class Ops implements Serializable{ public abstract Integer apply(Integer x); public void doSomething(JavaRDD rdd) { rdd.map(x -> x + apply(x)) .collect() .forEach(System.out::println); } } public class

Re: How to programmatically create, submit and report on Spark jobs?

2015-08-10 Thread Ted Yu
I found SPARK-3733 which was marked dup of SPARK-4924 which went to 1.4.0 FYI On Mon, Aug 10, 2015 at 5:12 AM, mark wrote: > Hi All > > I need to be able to create, submit and report on Spark jobs > programmatically in response to events arriving on a Kafka bus. I also need > end-users to be ab

Re: multiple dependency jars using pyspark

2015-08-10 Thread Jonathan Haddad
I figured out the issue - it had to do with the Cassandra jar I had compiled. I had tested a previous version. Using --jars (comma separated) and --driver-class-path (colon separated) is working. On Mon, Aug 10, 2015 at 1:08 AM ayan guha wrote: > Easiest way should be to add both jars in SPARK

How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
We're getting the below error. Tried increasing spark.executor.memory e.g. from 1g to 2g but the below error still happens. Any recommendations? Something to do with specifying -Xmx in the submit job scripts? Thanks. Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit excee

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
That looks like it's during recovery from a checkpoint, so it'd be driver memory not executor memory. How big is the checkpoint directory that you're trying to restore from? On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg < dgoldenberg...@gmail.com> wrote: > We're getting the below error. T

Re: ClosureCleaner does not work for java code

2015-08-10 Thread Sean Owen
The difference is really that Java and Scala work differently. In Java, your anonymous subclass of Ops defined in (a method of) AbstractTest captures a reference to it. That much is 'correct' in that it's how Java is supposed to work, and AbstractTest is indeed not serializable since you didn't dec

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Thanks, Cody, will try that. Unfortunately due to a reinstall I don't have the original checkpointing directory :( Thanks for the clarification on spark.driver.memory, I'll keep testing (at 2g things seem OK for now). On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger wrote: > That looks like it'

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Ted Yu
I wonder during recovery from a checkpoint whether we can estimate the size of the checkpoint and compare with Runtime.getRuntime().freeMemory(). If the size of checkpoint is much bigger than free memory, log warning, etc Cheers On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg wrote: > Thank

Problem with take vs. takeSample in PySpark

2015-08-10 Thread David Montague
Hi all, I am getting some strange behavior with the RDD take function in PySpark while doing some interactive coding in an IPython notebook. I am running PySpark on Spark 1.2.0 in yarn-client mode if that is relevant. I am using sc.wholeTextFiles and pandas to load a collection of .csv files int

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Would there be a way to chunk up/batch up the contents of the checkpointing directories as they're being processed by Spark Streaming? Is it mandatory to load the whole thing in one go? On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu wrote: > I wonder during recovery from a checkpoint whether we can e

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
You need to keep a certain number of rdds around for checkpointing, based on e.g. the window size. Those would all need to be loaded at once. On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg < dgoldenberg...@gmail.com> wrote: > Would there be a way to chunk up/batch up the contents of the > c

Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I am trying to run this program as a yarn-client. The job seems to be submitting successfully however I don't see any process listening on this host on port https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.j

Re: How to create DataFrame from a binary file?

2015-08-10 Thread Umesh Kacha
Hi Bo thanks much let me explain please see the following code JavaPairRDD pairRdd = javaSparkContext.binaryFiles("/hdfs/path/to/binfile"); JavaRDD javardd = pairRdd.values(); DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd, PortableDataStream.class); binDataFrame.show(); //shows

Re: How to create DataFrame from a binary file?

2015-08-10 Thread Ted Yu
Umesh: Please take a look at the classes under: sql/core/src/main/scala/org/apache/spark/sql/parquet FYI On Mon, Aug 10, 2015 at 10:35 AM, Umesh Kacha wrote: > Hi Bo thanks much let me explain please see the following code > > JavaPairRDD pairRdd = > javaSparkContext.binaryFiles("/hdfs/path/to/

Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread allonsy
Hi everyone, I recently started using the new Kafka direct approach. Now, as far as I understood, each Kafka partition /is/ an RDD partition that will be processed by a single core. What I don't understand is the relation between those partitions and the blocks generated every blockInterval. For

subscribe

2015-08-10 Thread Phil Kallos
please

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Ted Yu
Looks like workaround is to reduce *window length.* *Cheers* On Mon, Aug 10, 2015 at 10:07 AM, Cody Koeninger wrote: > You need to keep a certain number of rdds around for checkpointing, based > on e.g. the window size. Those would all need to be loaded at once. > > On Mon, Aug 10, 2015 at 11:

Re: subscribe

2015-08-10 Thread Ted Yu
Please take a look at the first section of https://spark.apache.org/community Cheers On Mon, Aug 10, 2015 at 10:54 AM, Phil Kallos wrote: > please >

Re: Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread Cody Koeninger
There's no long-running receiver pushing blocks of messages, so blockInterval isn't relevant. Batch interval is what matters. On Mon, Aug 10, 2015 at 12:52 PM, allonsy wrote: > Hi everyone, > > I recently started using the new Kafka direct approach. > > Now, as far as I understood, each Kafka p

Re: Problem with take vs. takeSample in PySpark

2015-08-10 Thread Davies Liu
I tested this in master (1.5 release), it worked as expected (changed spark.driver.maxResultSize to 10m), >>> len(sc.range(10).map(lambda i: '*' * (1<<23) ).take(1)) 1 >>> len(sc.range(10).map(lambda i: '*' * (1<<24) ).take(1)) 15/08/10 10:45:55 ERROR TaskSetManager: Total size of serialized resul

Re: Problems getting expected results from hbase_inputformat.py

2015-08-10 Thread Eric Bless
Thank you Gen, the changes to HBaseConverters.scala look to now be returning all column qualifiers, as follows -  (u'row1', {u'qualifier': u'a', u'timestamp': u'1438716994027', u'value': u'value1', u'columnFamily': u'f1', u'type': u'Put', u'row': u'row1'}) (u'row1', {u'qualifier': u'b', u'timest

Re: Problems getting expected results from hbase_inputformat.py

2015-08-10 Thread Ted Yu
Eric: Other than HBaseConverters.scala, examples/src/main/python/hbase_inputformat.py was also updated. FYI On Mon, Aug 10, 2015 at 11:08 AM, Eric Bless wrote: > Thank you Gen, the changes to HBaseConverters.scala look to now be > returning all column qualifiers, as follows - > > (u'row1', {u'qu

Re: Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread Luca
Thank you! :) 2015-08-10 19:58 GMT+02:00 Cody Koeninger : > There's no long-running receiver pushing blocks of messages, so > blockInterval isn't relevant. > > Batch interval is what matters. > > On Mon, Aug 10, 2015 at 12:52 PM, allonsy wrote: > >> Hi everyone, >> >> I recently started using th

Re: subscribe

2015-08-10 Thread Brandon White
https://www.youtube.com/watch?v=H07zYvkNYL8 On Mon, Aug 10, 2015 at 10:55 AM, Ted Yu wrote: > Please take a look at the first section of > https://spark.apache.org/community > > Cheers > > On Mon, Aug 10, 2015 at 10:54 AM, Phil Kallos > wrote: > >> please >> > >

How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread unk1102
Hi I have my own Hadoop custom InputFormat which I want to use in DataFrame. How do we do that? I know I can use sc.hadoopFile(..) but then how do I convert it into DataFrame JavaPairRDD myFormatAsPairRdd = jsc.hadoopFile("hdfs://tmp/data/myformat.xyz",MyInputFormat.class,Void.class,MyRecordWritab

Re: Pagination on big table, splitting joins

2015-08-10 Thread Michael Armbrust
> > I think to use *toLocalIterator* method and something like that, but I > have doubts about memory and parallelism and sure there is a better way to > do it. > It will still run all earlier parts of the job in parallel. Only the actual retrieving of the final partitions will be serial. This i

Re: How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread Michael Armbrust
You can't create a DataFrame from an arbitrary object since we don't know how to figure out the schema. You can either create a JavaBean or manually create a row + specify the schema

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
If you are running on a cluster, the listening is occurring on one of the executors, not in the driver. On Mon, Aug 10, 2015 at 10:29 AM, Mohit Anchlia wrote: > I am trying to run this program as a yarn-client. The job seems to be > submitting successfully however I don't see any process listeni

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Michael Armbrust
Older versions of Spark (i.e. when it was still called SchemaRDD instead of DataFrame) did not support merging different parquet schema. However, Spark 1.4+ should. On Sat, Aug 8, 2015 at 8:58 PM, sim wrote: > Adam, did you find a solution for this? > > > > -- > View this message in context: >

Re: How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread Umesh Kacha
Hi Michael thanks for the reply. I know that I can create DataFrame using JavaBean or Struct Type I want to know how can I create DataFrame from above code which is custom Hadoop format. On Tue, Aug 11, 2015 at 12:04 AM, Michael Armbrust wrote: > You can't create a DataFrame from an arbitrary ob

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Simeon Simeonov
Michael, is there an example anywhere that demonstrates how this works with the schema changing over time? Must the Hive tables be set up as external tables outside of saveAsTable? In my experience, in 1.4.1, writing to a table with SaveMode.Append fails if the schema don't match. Thanks, Sim

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I am running as a yarn-client which probably means that the program that submitted the job is where the listening is also occurring? I thought that the yarn is only used to negotiate resources in yarn-client master mode. On Mon, Aug 10, 2015 at 11:34 AM, Tathagata Das wrote: > If you are running

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
Any help in best recommendation for gracefully shutting down a spark stream application ? I am running it on yarn and a way to tell from externally either yarn application -kill command or some other way but need current batch to be processed completely and checkpoint to be saved before shutting do

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
In yarn-client mode, the driver is on the machine where you ran the spark-submit. The executors are running in the YARN cluster nodes, and the socket receiver listening on port is running in one of the executors. On Mon, Aug 10, 2015 at 11:43 AM, Mohit Anchlia wrote: > I am running as a yar

Re: ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver

2015-08-10 Thread Tathagata Das
I think this may be expected. When the streaming context is stopped without the SparkContext, then the receivers are stopped by the driver. The receiver sends back the message that it has been stopped. This is being (probably incorrectly) logged with ERROR level. On Sun, Aug 9, 2015 at 12:52 AM, S

Re: Graceful shutdown for Spark Streaming

2015-08-10 Thread Michal Čizmazia
>From logs, it seems that Spark Streaming does handle *kill -SIGINT* with graceful shutdown. Please could you confirm? Thanks! On 30 July 2015 at 08:19, anshu shukla wrote: > Yes I was doing same , if You mean that this is the correct way to do > Then I will verify it once more in my cas

Re: stopping spark stream app

2015-08-10 Thread Tathagata Das
In general, it is a little risky to put long running stuff in a shutdown hook as it may delay shutdown of the process which may delay other things. That said, you could try it out. A better way to explicitly shutdown gracefully is to use an RPC to signal the driver process to start shutting down,

Re: Graceful shutdown for Spark Streaming

2015-08-10 Thread Tathagata Das
Note that this is true only from Spark 1.4 where the shutdown hooks were added. On Mon, Aug 10, 2015 at 12:12 PM, Michal Čizmazia wrote: > From logs, it seems that Spark Streaming does handle *kill -SIGINT* with > graceful shutdown. > > Please could you confirm? > > Thanks! > > On 30 July 2015 a

Fw: Your Application has been Received

2015-08-10 Thread Shing Hing Man
Bar123 On Monday, 10 August 2015, 20:20, Resourcing Team wrote: Dear Shing Hing, Thank you for applying to Barclays. We have received your application and are currently reviewing your details. Updates on your progress will be emailed to you and can be accessed through your profile

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
By RPC you mean web service exposed on driver which listens and set some flag and driver checks that flag at end of each batch and if set then gracefully stop the application ? On Tue, Aug 11, 2015 at 12:43 AM, Tathagata Das wrote: > In general, it is a little risky to put long running stuff in

Java Streaming Context - File Stream use

2015-08-10 Thread Ashish Soni
Please help as not sure what is incorrect with below code as it gives me complilaton error in eclipse SparkConf sparkConf = new SparkConf().setMaster("local[4]").setAppName("JavaDirectKafkaWordCount"); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Duratio

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I've verified all the executors and I don't see a process listening on the port. However, the application seem to show as running in the yarn UI On Mon, Aug 10, 2015 at 11:56 AM, Tathagata Das wrote: > In yarn-client mode, the driver is on the machine where you ran the > spark-submit. The execut

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Michael Armbrust
What is the error you are getting? It would also be awesome if you could try with Spark 1.5 when the first preview comes out (hopefully early next week). On Mon, Aug 10, 2015 at 11:41 AM, Simeon Simeonov wrote: > Michael, is there an example anywhere that demonstrates how this works > with the

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Simeon Simeonov
Michael, please, see http://apache-spark-user-list.1001560.n3.nabble.com/Schema-evolution-in-tables-tt23999.html The exception is java.lang.RuntimeException: Relation[ ... ] org.apache.spark.sql.parquet.ParquetRelation2@83a73a05 requires that the query in the SELECT clause of the INSERT INTO/O

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
Is it receiving any data? If so, then it must be listening. Alternatively, to test these theories, you can locally running a spark standalone cluster (one node standalone cluster in local machine), and submit your app in client mode on that to see whether you are seeing the process listening on 999

Re: stopping spark stream app

2015-08-10 Thread Tathagata Das
1. RPC can be done in many ways, and a web service is one of many ways. A even more hacky version can be the app polling a file in a file system, if the file exists start shutting down. 2. No need to set a flag. When you get the signal from RPC, you can just call context.stop(stopGracefully = true)

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
"You need to keep a certain number of rdds around for checkpointing" -- that seems like a hefty expense to pay in order to achieve fault tolerance. Why does Spark persist whole RDD's of data? Shouldn't it be sufficient to just persist the offsets, to know where to resume from? Thanks. On Mon, A

Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
Hello, Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries to a data frame and I'm trying to figure out if I just have a bad setup or if this is a bug. As for the exceptions I get: when using selectExpr() with a string as an argument, I get "NoSuchElementException: key not f

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
The rdd is indeed defined by mostly just the offsets / topic partitions. On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg wrote: > "You need to keep a certain number of rdds around for checkpointing" -- > that seems like a hefty expense to pay in order to achieve fault > tolerance. Why does S

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Well, RDD"s also contain data, don't they? The question is, what can be so hefty in the checkpointing directory to cause Spark driver to run out of memory? It seems that it makes checkpointing expensive, in terms of I/O and memory consumption. Two network hops -- to driver, then to workers. Hef

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
No, it's not like a given KafkaRDD object contains an array of messages that gets serialized with the object. Its compute method generates an iterator of messages as needed, by connecting to kafka. I don't know what was so hefty in your checkpoint directory, because you deleted it. My checkpoint

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Michael Armbrust
You will need to use a HiveContext for window functions to work. On Mon, Aug 10, 2015 at 1:26 PM, Jerry wrote: > Hello, > > Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries > to a data frame and I'm trying to figure out if I just have a bad setup or > if this is a bug.

Optimal way to implement a small lookup table for identifiers in an RDD

2015-08-10 Thread Mike Trienis
Hi All, I have an RDD of case class objects. scala> case class Entity( | value: String, | identifier: String | ) defined class Entity scala> Entity("hello", "id1") res25: Entity = Entity(hello,id1) During a map operation, I'd like to return a new RDD that contains all of

When will window ....

2015-08-10 Thread Martin Senne
When will window functions be integrated into Spark (without HiveContext?) Gesendet mit AquaMail für Android http://www.aqua-mail.com Am 10. August 2015 23:04:22 schrieb Michael Armbrust : You will need to use a HiveContext for window functions to work. On Mon, Aug 10, 2015 at 1:26 PM, Jerry

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
Thanks! On Tue, Aug 11, 2015 at 1:34 AM, Tathagata Das wrote: > 1. RPC can be done in many ways, and a web service is one of many ways. A > even more hacky version can be the app polling a file in a file system, if > the file exists start shutting down. > 2. No need to set a flag. When you get

avoid duplicate due to executor failure in spark stream

2015-08-10 Thread Shushant Arora
Hi How can I avoid duplicate processing of kafka messages in spark stream 1.3 because of executor failure. 1.Can I some how access accumulators of failed task in retry task to skip those many events which are already processed by failed task on this partition ? 2.Or I ll have to persist each ms

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
Thanks... looks like I now hit that bug about HiveMetaStoreClient as I now get the message about being unable to instantiate it. On a side note, does anyone know where hive-site.xml is typically located? Thanks, Jerry On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust wrote: > You will

Re: avoid duplicate due to executor failure in spark stream

2015-08-10 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations https://www.youtube.com/watch?v=fXnNEq1v3VA On Mon, Aug 10, 2015 at 4:32 PM, Shushant

collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread YaoPau
I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via iPython Notebook. I'm getting collect() to work just fine, but take() errors. (I'm having issues with collect() on other datasets ... but take() seems to break every time I run it.) My code is below. Any thoughts? >> sc

Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread Davies Liu
Is it possible that you have Python 2.7 on the driver, but Python 2.6 on the workers?. PySpark requires that you have the same minor version of Python in both driver and worker. In PySpark 1.4+, it will do this check before run any tasks. On Mon, Aug 10, 2015 at 2:53 PM, YaoPau wrote: > I'm runn

Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread Ruslan Dautkhanov
There is was a similar problem reported before on this list. Weird python errors like this generally mean you have different versions of python in the nodes of your cluster. Can you check that? >From error stack you use 2.7.10 |Anaconda 2.3.0 while OS/CDH version of Python is probably 2.6. --

Re: Do I really need to build Spark for Hive/Thrift Server support?

2015-08-10 Thread roni
Hi All, Any explanation for this? As Reece said I can do operations with hive but - val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) -- gives error. I have already created spark ec2 cluster with the spark-ec2 script. How can I build it again? Thanks _Roni On Tue, Jul 28, 2015 at

Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-10 Thread pkphlam
Hi, If I understand the RandomForest model in the ML Pipeline implementation in the ml package correctly, I have to first run my outcome label variable through the StringIndexer, even if my labels are numeric. The StringIndexer then converts the labels into numeric indices based on frequency of th

Re: can't start master node on a standalone environment

2015-08-10 Thread pradyumnad
It seems like the jars are missing. Can you post the full log ? Expand ..6 more -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-start-master-node-on-a-standalone-environment-tp24160p24201.html Sent from the Apache Spark User List mailing list archive

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I am using the same exact code: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.java Submitting like this: yarn: /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/bin/spark-submit --class org.sony.spark.stream

Re: Twitter live Streaming

2015-08-10 Thread pradyumnad
Streaming API, as in the name, gives out the live stream of tweets which are posted right then. If you would like to get the old tweets use the rest API from Twitter. Twitter4j is the twitter library that I use and suggest for the task. -- View this message in context: http://apache-spark-user

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I do see this message: 15/08/10 19:19:12 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources On Mon, Aug 10, 2015 at 4:15 PM, Mohit Anchlia wrote: > > I am using the same exact code: > > > http

Re: TFIDF Transformation

2015-08-10 Thread pradyumnad
If you want to convert the hash to word, the very thought defies the usage of hashing. You may map the words with hashing, but that wouldn't be good. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TFIDF-Transformation-tp24086p24203.html Sent from the Apach

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
1. When you are running locally, make sure the "master" in the SparkConf reflects that and is not somehow set to "yarn-client" 2. You may not be getting any resources from YARN at all, so no executors, so no receiver running. That is why I asked the most basic question - Is it receiving data? That

  1   2   >