RE: Refresh table

2015-08-10 Thread Cheng, Hao
Refreshing table only works for the Spark SQL DataSource in my understanding, apparently here, you’re running a Hive Table. Can you try to create a table like: |CREATE TEMPORARY TABLE parquetTable (a int, b string) |USING org.apache.spark.sql.parquet.DefaultSource |OPTIO

Re: Controlling number of executors on Mesos vs YARN

2015-08-10 Thread Haripriya Ayyalasomayajula
Hi Tim, Spark on Yarn allows us to do it using --num-executors and --executor_cores commandline arguments. I just got a chance to look at a similar spark user list mail, but no answer yet. So does mesos allow setting the number of executors and cores? Is there a default number it assumes? On Mon,

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-08-10 Thread Haripriya Ayyalasomayajula
Hello all, As a quick follow up for this, I have been using Spark on Yarn till now and am currently exploring Mesos and Marathon. Using yarn, we could tell the spark job about the number of executors and number of cores as well, is there a way to do it on mesos? I'm using Spark 1.4.1 on Mesos 0.23

Refresh table

2015-08-10 Thread Jerrick Hoang
Hi all, I'm a little confused about how refresh table (SPARK-5833) should work. So I did the following, val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") df1.write.parquet("hdfs:///test_table/key=1") Then I created an external table by doing, CREATE EXTERNAL TABLE `tm

Error while output JavaDStream to disk and mongodb

2015-08-10 Thread Deepesh Maheshwari
Hi, I have successfully reduced my data and store it in JavaDStream Now, i want to save this data in mongodb for this i have used BSONObject type. But, when i try to save it, it is giving exception. For this, i also try to save it just as *saveAsTextFile *but same exception. Error Log : attache

Re: mllib kmeans produce 1 large and many extremely small clusters

2015-08-10 Thread sooraj
Hi, The issue is very likely to be in the data or the transformations you apply, rather than anything to do with the Spark Kmeans API as such. I'd start debugging by doing a bit of exploratory analysis of the TFIDF vectors. That is, for instance, plot the distribution (histogram) of the TFIDF valu

Re: How to programmatically create, submit and report on Spark jobs?

2015-08-10 Thread Ted Yu
For monitoring, please take a look at http://spark.apache.org/docs/latest/monitoring.html Especially REST API section. Cheers On Mon, Aug 10, 2015 at 8:33 AM, Ted Yu wrote: > I found SPARK-3733 which was marked dup of SPARK-4924 which went to 1.4.0 > > FYI > > On Mon, Aug 10, 2015 at 5:12 AM, m

Writing a DataFrame as compressed JSON

2015-08-10 Thread sim
DataFrameReader.json() can handle gzipped JSONlines files automatically but there doesn't seem to be a way to get DataFrameWriter.json() to write compressed JSONlines files. Uncompressed JSONlines is a very expensive from an I/O standpoint because field names are included in every record. Is there

Re: Differents in loading data using spark datasource api and using jdbc

2015-08-10 Thread satish chandra j
Hi, As I understand JDBC is meant for moderate voulme of data but Datasource api is a better option if volume of data volume is more Datasource API is not available is lower version of Spark such as 1.2.0 Regards, Satish On Tue, Aug 11, 2015 at 8:53 AM, 李铖 wrote: > Hi,everyone. > > I have one q

Re: Spark Cassandra Connector issue

2015-08-10 Thread satish chandra j
HI All, I have tried Commands as mentioned below but still it is errors dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars /home/missingmerch/ postgresql-9.4-1201.jdbc41.jar,/home/missingmerch/dse.jar,/home/missingmerch/spark- cassandra-connector-java_2.10-1.1.1.jar /hom

Differents in loading data using spark datasource api and using jdbc

2015-08-10 Thread 李铖
Hi,everyone. I have one question in loading data using spark datasource api and using jdbc that which way is effective?

Re: Accessing S3 files with s3n://

2015-08-10 Thread Akshat Aranya
Hi Jerry, Akhil, Thanks your your help. With s3n, the entire file is downloaded even while just creating the RDD with sqlContext.read.parquet(). It seems like even just opening and closing the InputStream causes the entire data to get fetched. As it turned out, I was able to use s3a and avoid th

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-10 Thread canan chen
Anyone know this ? Thanks On Fri, Aug 7, 2015 at 4:20 PM, canan chen wrote: > Is there any reason that historyserver use another property for the event > log dir ? Thanks >

Inquery about contributing codes

2015-08-10 Thread Hyukjin Kwon
Dear Sir / Madam, I have a plan to contribute some codes about passing filters to a datasource as physical planning. In more detail, I understand when we want to build up filter operations from data like Parquet (when actually reading and filtering HDFS blocks at first not filtering in memory wit

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
By the way, if Hive is present in the Spark install, does show up in text when you start the spark shell? Any commands I can run to check if it exists? I didn't setup the spark machine that I use, so I don't know what's present or absent. Thanks, Jerry On Mon, Aug 10, 2015 at 2:38 PM, Jer

Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread Jon Gregg
We did have 2.7 on the driver, 2.6 on the edge nodes and figured that was the issue, so we've tried many combinations since then with all three of 2.6.6, 2.7.5, and Anaconda's 2.7.10 on each node with different PATHs and PYTHONPATHs each time. Every combination has produced the same error. We cam

Re: Spark Kinesis Checkpointing/Processing Delay

2015-08-10 Thread Tathagata Das
You are correct. The earlier Kinesis receiver (as of Spark 1.4) was not saving checkpoints correctly and was in general not reliable (even with WAL enabled). We have improved this in Spark 1.5 with updated Kinesis receiver, that keeps track of the Kinesis sequence numbers as part of the Spark Strea

Re: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-10 Thread Ted Yu
Yan / Bing: Mind taking a look at HBASE-14181 'Add Spark DataFrame DataSource to HBase-Spark Module' ? Thanks On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) wrote: > We are happy to announce the availability of the Spark SQL on HBase 1.0.0

Spark Kinesis Checkpointing/Processing Delay

2015-08-10 Thread Phil Kallos
Hi! Sorry if this is a repost. I'm using Spark + Kinesis ASL to process and persist stream data to ElasticSearch. For the most part it works nicely. There is a subtle issue I'm running into about how failures are handled. For example's sake, let's say I am processing a Kinesis stream that produc

Re: Json parsing library for Spark Streaming?

2015-08-10 Thread pradyumnad
I use Play json, may be its very famous. If you would like to try below is the sbt dependency "com.typesafe.play" % "play-json_2.10" % "2.2.1", -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Json-parsing-library-for-Spark-Streaming-tp24016p24204.html Sent

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
1. When you are running locally, make sure the "master" in the SparkConf reflects that and is not somehow set to "yarn-client" 2. You may not be getting any resources from YARN at all, so no executors, so no receiver running. That is why I asked the most basic question - Is it receiving data? That

Re: TFIDF Transformation

2015-08-10 Thread pradyumnad
If you want to convert the hash to word, the very thought defies the usage of hashing. You may map the words with hashing, but that wouldn't be good. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TFIDF-Transformation-tp24086p24203.html Sent from the Apach

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I do see this message: 15/08/10 19:19:12 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources On Mon, Aug 10, 2015 at 4:15 PM, Mohit Anchlia wrote: > > I am using the same exact code: > > > http

Re: Twitter live Streaming

2015-08-10 Thread pradyumnad
Streaming API, as in the name, gives out the live stream of tweets which are posted right then. If you would like to get the old tweets use the rest API from Twitter. Twitter4j is the twitter library that I use and suggest for the task. -- View this message in context: http://apache-spark-user

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I am using the same exact code: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.java Submitting like this: yarn: /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/bin/spark-submit --class org.sony.spark.stream

Re: can't start master node on a standalone environment

2015-08-10 Thread pradyumnad
It seems like the jars are missing. Can you post the full log ? Expand ..6 more -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-start-master-node-on-a-standalone-environment-tp24160p24201.html Sent from the Apache Spark User List mailing list archive

Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-10 Thread pkphlam
Hi, If I understand the RandomForest model in the ML Pipeline implementation in the ml package correctly, I have to first run my outcome label variable through the StringIndexer, even if my labels are numeric. The StringIndexer then converts the labels into numeric indices based on frequency of th

Re: Do I really need to build Spark for Hive/Thrift Server support?

2015-08-10 Thread roni
Hi All, Any explanation for this? As Reece said I can do operations with hive but - val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) -- gives error. I have already created spark ec2 cluster with the spark-ec2 script. How can I build it again? Thanks _Roni On Tue, Jul 28, 2015 at

Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread Ruslan Dautkhanov
There is was a similar problem reported before on this list. Weird python errors like this generally mean you have different versions of python in the nodes of your cluster. Can you check that? >From error stack you use 2.7.10 |Anaconda 2.3.0 while OS/CDH version of Python is probably 2.6. --

Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread Davies Liu
Is it possible that you have Python 2.7 on the driver, but Python 2.6 on the workers?. PySpark requires that you have the same minor version of Python in both driver and worker. In PySpark 1.4+, it will do this check before run any tasks. On Mon, Aug 10, 2015 at 2:53 PM, YaoPau wrote: > I'm runn

collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread YaoPau
I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via iPython Notebook. I'm getting collect() to work just fine, but take() errors. (I'm having issues with collect() on other datasets ... but take() seems to break every time I run it.) My code is below. Any thoughts? >> sc

Re: avoid duplicate due to executor failure in spark stream

2015-08-10 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations https://www.youtube.com/watch?v=fXnNEq1v3VA On Mon, Aug 10, 2015 at 4:32 PM, Shushant

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
Thanks... looks like I now hit that bug about HiveMetaStoreClient as I now get the message about being unable to instantiate it. On a side note, does anyone know where hive-site.xml is typically located? Thanks, Jerry On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust wrote: > You will

avoid duplicate due to executor failure in spark stream

2015-08-10 Thread Shushant Arora
Hi How can I avoid duplicate processing of kafka messages in spark stream 1.3 because of executor failure. 1.Can I some how access accumulators of failed task in retry task to skip those many events which are already processed by failed task on this partition ? 2.Or I ll have to persist each ms

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
Thanks! On Tue, Aug 11, 2015 at 1:34 AM, Tathagata Das wrote: > 1. RPC can be done in many ways, and a web service is one of many ways. A > even more hacky version can be the app polling a file in a file system, if > the file exists start shutting down. > 2. No need to set a flag. When you get

When will window ....

2015-08-10 Thread Martin Senne
When will window functions be integrated into Spark (without HiveContext?) Gesendet mit AquaMail für Android http://www.aqua-mail.com Am 10. August 2015 23:04:22 schrieb Michael Armbrust : You will need to use a HiveContext for window functions to work. On Mon, Aug 10, 2015 at 1:26 PM, Jerry

Optimal way to implement a small lookup table for identifiers in an RDD

2015-08-10 Thread Mike Trienis
Hi All, I have an RDD of case class objects. scala> case class Entity( | value: String, | identifier: String | ) defined class Entity scala> Entity("hello", "id1") res25: Entity = Entity(hello,id1) During a map operation, I'd like to return a new RDD that contains all of

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Michael Armbrust
You will need to use a HiveContext for window functions to work. On Mon, Aug 10, 2015 at 1:26 PM, Jerry wrote: > Hello, > > Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries > to a data frame and I'm trying to figure out if I just have a bad setup or > if this is a bug.

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
No, it's not like a given KafkaRDD object contains an array of messages that gets serialized with the object. Its compute method generates an iterator of messages as needed, by connecting to kafka. I don't know what was so hefty in your checkpoint directory, because you deleted it. My checkpoint

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Well, RDD"s also contain data, don't they? The question is, what can be so hefty in the checkpointing directory to cause Spark driver to run out of memory? It seems that it makes checkpointing expensive, in terms of I/O and memory consumption. Two network hops -- to driver, then to workers. Hef

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
The rdd is indeed defined by mostly just the offsets / topic partitions. On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg wrote: > "You need to keep a certain number of rdds around for checkpointing" -- > that seems like a hefty expense to pay in order to achieve fault > tolerance. Why does S

Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
Hello, Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries to a data frame and I'm trying to figure out if I just have a bad setup or if this is a bug. As for the exceptions I get: when using selectExpr() with a string as an argument, I get "NoSuchElementException: key not f

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
"You need to keep a certain number of rdds around for checkpointing" -- that seems like a hefty expense to pay in order to achieve fault tolerance. Why does Spark persist whole RDD's of data? Shouldn't it be sufficient to just persist the offsets, to know where to resume from? Thanks. On Mon, A

Re: stopping spark stream app

2015-08-10 Thread Tathagata Das
1. RPC can be done in many ways, and a web service is one of many ways. A even more hacky version can be the app polling a file in a file system, if the file exists start shutting down. 2. No need to set a flag. When you get the signal from RPC, you can just call context.stop(stopGracefully = true)

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
Is it receiving any data? If so, then it must be listening. Alternatively, to test these theories, you can locally running a spark standalone cluster (one node standalone cluster in local machine), and submit your app in client mode on that to see whether you are seeing the process listening on 999

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Simeon Simeonov
Michael, please, see http://apache-spark-user-list.1001560.n3.nabble.com/Schema-evolution-in-tables-tt23999.html The exception is java.lang.RuntimeException: Relation[ ... ] org.apache.spark.sql.parquet.ParquetRelation2@83a73a05 requires that the query in the SELECT clause of the INSERT INTO/O

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Michael Armbrust
What is the error you are getting? It would also be awesome if you could try with Spark 1.5 when the first preview comes out (hopefully early next week). On Mon, Aug 10, 2015 at 11:41 AM, Simeon Simeonov wrote: > Michael, is there an example anywhere that demonstrates how this works > with the

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I've verified all the executors and I don't see a process listening on the port. However, the application seem to show as running in the yarn UI On Mon, Aug 10, 2015 at 11:56 AM, Tathagata Das wrote: > In yarn-client mode, the driver is on the machine where you ran the > spark-submit. The execut

Java Streaming Context - File Stream use

2015-08-10 Thread Ashish Soni
Please help as not sure what is incorrect with below code as it gives me complilaton error in eclipse SparkConf sparkConf = new SparkConf().setMaster("local[4]").setAppName("JavaDirectKafkaWordCount"); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Duratio

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
By RPC you mean web service exposed on driver which listens and set some flag and driver checks that flag at end of each batch and if set then gracefully stop the application ? On Tue, Aug 11, 2015 at 12:43 AM, Tathagata Das wrote: > In general, it is a little risky to put long running stuff in

Fw: Your Application has been Received

2015-08-10 Thread Shing Hing Man
Bar123 On Monday, 10 August 2015, 20:20, Resourcing Team wrote: Dear Shing Hing, Thank you for applying to Barclays. We have received your application and are currently reviewing your details. Updates on your progress will be emailed to you and can be accessed through your profile

Re: Graceful shutdown for Spark Streaming

2015-08-10 Thread Tathagata Das
Note that this is true only from Spark 1.4 where the shutdown hooks were added. On Mon, Aug 10, 2015 at 12:12 PM, Michal Čizmazia wrote: > From logs, it seems that Spark Streaming does handle *kill -SIGINT* with > graceful shutdown. > > Please could you confirm? > > Thanks! > > On 30 July 2015 a

Re: stopping spark stream app

2015-08-10 Thread Tathagata Das
In general, it is a little risky to put long running stuff in a shutdown hook as it may delay shutdown of the process which may delay other things. That said, you could try it out. A better way to explicitly shutdown gracefully is to use an RPC to signal the driver process to start shutting down,

Re: Graceful shutdown for Spark Streaming

2015-08-10 Thread Michal Čizmazia
>From logs, it seems that Spark Streaming does handle *kill -SIGINT* with graceful shutdown. Please could you confirm? Thanks! On 30 July 2015 at 08:19, anshu shukla wrote: > Yes I was doing same , if You mean that this is the correct way to do > Then I will verify it once more in my cas

Re: ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver

2015-08-10 Thread Tathagata Das
I think this may be expected. When the streaming context is stopped without the SparkContext, then the receivers are stopped by the driver. The receiver sends back the message that it has been stopped. This is being (probably incorrectly) logged with ERROR level. On Sun, Aug 9, 2015 at 12:52 AM, S

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
In yarn-client mode, the driver is on the machine where you ran the spark-submit. The executors are running in the YARN cluster nodes, and the socket receiver listening on port is running in one of the executors. On Mon, Aug 10, 2015 at 11:43 AM, Mohit Anchlia wrote: > I am running as a yar

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
Any help in best recommendation for gracefully shutting down a spark stream application ? I am running it on yarn and a way to tell from externally either yarn application -kill command or some other way but need current batch to be processed completely and checkpoint to be saved before shutting do

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I am running as a yarn-client which probably means that the program that submitted the job is where the listening is also occurring? I thought that the yarn is only used to negotiate resources in yarn-client master mode. On Mon, Aug 10, 2015 at 11:34 AM, Tathagata Das wrote: > If you are running

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Simeon Simeonov
Michael, is there an example anywhere that demonstrates how this works with the schema changing over time? Must the Hive tables be set up as external tables outside of saveAsTable? In my experience, in 1.4.1, writing to a table with SaveMode.Append fails if the schema don't match. Thanks, Sim

Re: How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread Umesh Kacha
Hi Michael thanks for the reply. I know that I can create DataFrame using JavaBean or Struct Type I want to know how can I create DataFrame from above code which is custom Hadoop format. On Tue, Aug 11, 2015 at 12:04 AM, Michael Armbrust wrote: > You can't create a DataFrame from an arbitrary ob

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Michael Armbrust
Older versions of Spark (i.e. when it was still called SchemaRDD instead of DataFrame) did not support merging different parquet schema. However, Spark 1.4+ should. On Sat, Aug 8, 2015 at 8:58 PM, sim wrote: > Adam, did you find a solution for this? > > > > -- > View this message in context: >

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
If you are running on a cluster, the listening is occurring on one of the executors, not in the driver. On Mon, Aug 10, 2015 at 10:29 AM, Mohit Anchlia wrote: > I am trying to run this program as a yarn-client. The job seems to be > submitting successfully however I don't see any process listeni

Re: How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread Michael Armbrust
You can't create a DataFrame from an arbitrary object since we don't know how to figure out the schema. You can either create a JavaBean or manually create a row + specify the schema

Re: Pagination on big table, splitting joins

2015-08-10 Thread Michael Armbrust
> > I think to use *toLocalIterator* method and something like that, but I > have doubts about memory and parallelism and sure there is a better way to > do it. > It will still run all earlier parts of the job in parallel. Only the actual retrieving of the final partitions will be serial. This i

How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread unk1102
Hi I have my own Hadoop custom InputFormat which I want to use in DataFrame. How do we do that? I know I can use sc.hadoopFile(..) but then how do I convert it into DataFrame JavaPairRDD myFormatAsPairRdd = jsc.hadoopFile("hdfs://tmp/data/myformat.xyz",MyInputFormat.class,Void.class,MyRecordWritab

Re: subscribe

2015-08-10 Thread Brandon White
https://www.youtube.com/watch?v=H07zYvkNYL8 On Mon, Aug 10, 2015 at 10:55 AM, Ted Yu wrote: > Please take a look at the first section of > https://spark.apache.org/community > > Cheers > > On Mon, Aug 10, 2015 at 10:54 AM, Phil Kallos > wrote: > >> please >> > >

Re: Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread Luca
Thank you! :) 2015-08-10 19:58 GMT+02:00 Cody Koeninger : > There's no long-running receiver pushing blocks of messages, so > blockInterval isn't relevant. > > Batch interval is what matters. > > On Mon, Aug 10, 2015 at 12:52 PM, allonsy wrote: > >> Hi everyone, >> >> I recently started using th

Re: Problems getting expected results from hbase_inputformat.py

2015-08-10 Thread Ted Yu
Eric: Other than HBaseConverters.scala, examples/src/main/python/hbase_inputformat.py was also updated. FYI On Mon, Aug 10, 2015 at 11:08 AM, Eric Bless wrote: > Thank you Gen, the changes to HBaseConverters.scala look to now be > returning all column qualifiers, as follows - > > (u'row1', {u'qu

Re: Problems getting expected results from hbase_inputformat.py

2015-08-10 Thread Eric Bless
Thank you Gen, the changes to HBaseConverters.scala look to now be returning all column qualifiers, as follows -  (u'row1', {u'qualifier': u'a', u'timestamp': u'1438716994027', u'value': u'value1', u'columnFamily': u'f1', u'type': u'Put', u'row': u'row1'}) (u'row1', {u'qualifier': u'b', u'timest

Re: Problem with take vs. takeSample in PySpark

2015-08-10 Thread Davies Liu
I tested this in master (1.5 release), it worked as expected (changed spark.driver.maxResultSize to 10m), >>> len(sc.range(10).map(lambda i: '*' * (1<<23) ).take(1)) 1 >>> len(sc.range(10).map(lambda i: '*' * (1<<24) ).take(1)) 15/08/10 10:45:55 ERROR TaskSetManager: Total size of serialized resul

Re: Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread Cody Koeninger
There's no long-running receiver pushing blocks of messages, so blockInterval isn't relevant. Batch interval is what matters. On Mon, Aug 10, 2015 at 12:52 PM, allonsy wrote: > Hi everyone, > > I recently started using the new Kafka direct approach. > > Now, as far as I understood, each Kafka p

Re: subscribe

2015-08-10 Thread Ted Yu
Please take a look at the first section of https://spark.apache.org/community Cheers On Mon, Aug 10, 2015 at 10:54 AM, Phil Kallos wrote: > please >

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Ted Yu
Looks like workaround is to reduce *window length.* *Cheers* On Mon, Aug 10, 2015 at 10:07 AM, Cody Koeninger wrote: > You need to keep a certain number of rdds around for checkpointing, based > on e.g. the window size. Those would all need to be loaded at once. > > On Mon, Aug 10, 2015 at 11:

subscribe

2015-08-10 Thread Phil Kallos
please

Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread allonsy
Hi everyone, I recently started using the new Kafka direct approach. Now, as far as I understood, each Kafka partition /is/ an RDD partition that will be processed by a single core. What I don't understand is the relation between those partitions and the blocks generated every blockInterval. For

Re: How to create DataFrame from a binary file?

2015-08-10 Thread Ted Yu
Umesh: Please take a look at the classes under: sql/core/src/main/scala/org/apache/spark/sql/parquet FYI On Mon, Aug 10, 2015 at 10:35 AM, Umesh Kacha wrote: > Hi Bo thanks much let me explain please see the following code > > JavaPairRDD pairRdd = > javaSparkContext.binaryFiles("/hdfs/path/to/

Re: How to create DataFrame from a binary file?

2015-08-10 Thread Umesh Kacha
Hi Bo thanks much let me explain please see the following code JavaPairRDD pairRdd = javaSparkContext.binaryFiles("/hdfs/path/to/binfile"); JavaRDD javardd = pairRdd.values(); DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd, PortableDataStream.class); binDataFrame.show(); //shows

Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I am trying to run this program as a yarn-client. The job seems to be submitting successfully however I don't see any process listening on this host on port https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.j

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
You need to keep a certain number of rdds around for checkpointing, based on e.g. the window size. Those would all need to be loaded at once. On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg < dgoldenberg...@gmail.com> wrote: > Would there be a way to chunk up/batch up the contents of the > c

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Would there be a way to chunk up/batch up the contents of the checkpointing directories as they're being processed by Spark Streaming? Is it mandatory to load the whole thing in one go? On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu wrote: > I wonder during recovery from a checkpoint whether we can e

Problem with take vs. takeSample in PySpark

2015-08-10 Thread David Montague
Hi all, I am getting some strange behavior with the RDD take function in PySpark while doing some interactive coding in an IPython notebook. I am running PySpark on Spark 1.2.0 in yarn-client mode if that is relevant. I am using sc.wholeTextFiles and pandas to load a collection of .csv files int

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Ted Yu
I wonder during recovery from a checkpoint whether we can estimate the size of the checkpoint and compare with Runtime.getRuntime().freeMemory(). If the size of checkpoint is much bigger than free memory, log warning, etc Cheers On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg wrote: > Thank

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Thanks, Cody, will try that. Unfortunately due to a reinstall I don't have the original checkpointing directory :( Thanks for the clarification on spark.driver.memory, I'll keep testing (at 2g things seem OK for now). On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger wrote: > That looks like it'

Re: ClosureCleaner does not work for java code

2015-08-10 Thread Sean Owen
The difference is really that Java and Scala work differently. In Java, your anonymous subclass of Ops defined in (a method of) AbstractTest captures a reference to it. That much is 'correct' in that it's how Java is supposed to work, and AbstractTest is indeed not serializable since you didn't dec

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
That looks like it's during recovery from a checkpoint, so it'd be driver memory not executor memory. How big is the checkpoint directory that you're trying to restore from? On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg < dgoldenberg...@gmail.com> wrote: > We're getting the below error. T

How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
We're getting the below error. Tried increasing spark.executor.memory e.g. from 1g to 2g but the below error still happens. Any recommendations? Something to do with specifying -Xmx in the submit job scripts? Thanks. Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit excee

Re: multiple dependency jars using pyspark

2015-08-10 Thread Jonathan Haddad
I figured out the issue - it had to do with the Cassandra jar I had compiled. I had tested a previous version. Using --jars (comma separated) and --driver-class-path (colon separated) is working. On Mon, Aug 10, 2015 at 1:08 AM ayan guha wrote: > Easiest way should be to add both jars in SPARK

Re: How to programmatically create, submit and report on Spark jobs?

2015-08-10 Thread Ted Yu
I found SPARK-3733 which was marked dup of SPARK-4924 which went to 1.4.0 FYI On Mon, Aug 10, 2015 at 5:12 AM, mark wrote: > Hi All > > I need to be able to create, submit and report on Spark jobs > programmatically in response to events arriving on a Kafka bus. I also need > end-users to be ab

ClosureCleaner does not work for java code

2015-08-10 Thread Hao Ren
Consider two code snippets as the following: // Java code: abstract class Ops implements Serializable{ public abstract Integer apply(Integer x); public void doSomething(JavaRDD rdd) { rdd.map(x -> x + apply(x)) .collect() .forEach(System.out::println); } } public class

Re: Questions about SparkSQL join on not equality conditions

2015-08-10 Thread gen tang
Hi, I am sorry to bother again. When I do join as follow: df = sqlContext.sql("selet a.someItem, b.someItem from a full outer join b on condition1 *or* condition2") df.first() The program failed at the result size is bigger than spark.driver.maxResultSize. It is really strange, as one record is n

Re: Estimate size of Dataframe programatically

2015-08-10 Thread Ted Yu
>From a quick glance of SparkStrategies.scala , when statistics.sizeInBytes of the LogicalPlan is <= autoBroadcastJoinThreshold, the plan's output would be used in broadcast join as the 'build' relation. FYI On Mon, Aug 10, 2015 at 8:04 AM, Srikanth wrote: > SizeEstimator.estimate(df) will not

Re: Estimate size of Dataframe programatically

2015-08-10 Thread Srikanth
SizeEstimator.estimate(df) will not give the size of dataframe rt? I think it will give size of df object. With RDD, I sample() and collect() and sum size of each row. If I do the same with dataframe it will no longer be size when represented in columnar format. I'd also like to know how spark.sq

Spark Streaming dealing with broken files without dying

2015-08-10 Thread Mario Pastorelli
Hey Sparkers, I would like to use Spark Streaming in production to observe a directory and process files that are put inside it. The problem is that some of those files can be broken leading to a IOException from the input reader. This should be fine for the framework I think: the exception should

Re: Spark Maven Build

2015-08-10 Thread Benyi Wang
Never mind. Instead of set property in the profile cdh5.3.2 2.5.0-cdh5.3.2 ... I have to change the property hadoop.version from 2.2.0 to 2.5.0-cdh5.3.2 in spark-parent's pom.xml. Otherwise, maven will resolve transitive dependencies using the default version 2.2.0.

spark vs flink low memory available

2015-08-10 Thread Pa Rö
hi community, i have build a spark and flink k-means application. my test case is a clustering on 1 million points on 3node cluster. in memory bottlenecks begins flink to outsource to disk and work slowly but works. however spark lose executers if the memory is full and starts again (infinety loo

spark vs flink low memory available

2015-08-10 Thread Pa Rö
hi community, i have build a spark and flink k-means application. my test case is a clustering on 1 million points on 3node cluster. in memory bottlenecks begins flink to outsource to disk and work slowly but works. however spark lose executers if the memory is full and starts again (infinety loo

Re: Spark Cassandra Connector issue

2015-08-10 Thread Dean Wampler
I don't know if DSE changed spark-submit, but you have to use a comma-separated list of jars to --jars. It probably looked for HelloWorld in the second one, the dse.jar file. Do this: dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars /home/missingmerch/ postgresql-9.4-1

Re: Spark Cassandra Connector issue

2015-08-10 Thread satish chandra j
Hi, Thanks for quick input, now I am getting class not found error *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/dse.jar ///home/missingmerch/spark-cassandra-connector-java_2.10-1.1

Re: spark-kafka directAPI vs receivers based API

2015-08-10 Thread Cody Koeninger
For direct stream questions: https://github.com/koeninger/kafka-exactly-once Yes, it is used in production. For general spark streaming question: http://spark.apache.org/docs/latest/streaming-programming-guide.html On Mon, Aug 10, 2015 at 7:51 AM, Mohit Durgapal wrote: > Hi All, > > I just

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Dean Wampler
So, just before running the job, if you run the HDFS command at a shell prompt: "hdfs dfs -ls hdfs://172.31.42.10:54310/./weblogReadResult". Does it say the path doesn't exist? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reil

  1   2   >