HiveContext throws org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2015-07-07 Thread bdev
Just trying to get started with Spark and attempting to use HiveContext using spark-shell to interact with existing Hive tables on my CDH cluster but keep running into the errors (pls see below) when I do 'hiveContext.sql("show tables")'. Wanted to know what all JARs need to be included to have thi

Dataframe to parquet using hdfs or parquet block size

2016-04-06 Thread bdev
I need to save the dataframe to parquet format and need some input on choosing the appropriate block size to help efficiently parallelize/localize the data to the executors. Should I be using parquet block size or hdfs block size and what is the optimal block size to use on a 100 node cluster? Th

Need clarification regd deploy-mode client

2016-04-08 Thread bdev
I'm running pyspark with deploy-mode as client with yarn using dynamic allocation: pyspark --master yarn --deploy-mode client --executor-memory 6g --executor-cores 4 --driver-memory 4g The node where I'm running pyspark has 4GB memory but I keep running out of memory on this node. If using yarn,

Re: Need clarification regd deploy-mode client

2016-04-08 Thread bdev
Thanks Mandar for the clarification. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-clarification-regd-deploy-mode-client-tp26719p26725.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

How to estimate the size of dataframe using pyspark?

2016-04-09 Thread bdev
I keep running out of memory on the driver when I attempt to do df.show(). Can anyone let me know how to estimate the size of the dataframe? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.

Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread bdev
Thanks Mandar, I couldn't see anything under the 'Storage Section' but under the Executors I noticed it to be 3.1 GB: Executors (1) Memory: 0.0 B Used (3.1 GB Total) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-usin

Ways to partition the RDD

2014-08-13 Thread bdev
I've got ~500 tab delimited log files 25gigs each with page name and userId who viewed the page along with timestamp. I'm trying to build a basic spark app to get a unique visitors per page. I was able to achieve this using SparkSQL by registering the RDD of a case class and running a select with

Re: Ways to partition the RDD

2014-08-13 Thread bdev
Forgot to mention, I'm using Spark 1.0.0 and running against 40 node yarn-cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12088.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks, will give that a try. I see the number of partitions requested is 8 (through HashPartitioner(8)). If I have a 40 node cluster, whats the recommended number of partitions? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp1

Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks Daniel for the detailed information. Since the RDD is already partitioned, there is no need to worry about repartitioning. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html Sent from the Apache Spark User Li

RE: Save an RDD to a SQL Database

2014-08-27 Thread bdev
I have similar requirement to export the data to mysql. Just wanted to know what the best approach is so far after the research you guys have done. Currently thinking of saving to hdfs and use sqoop to handle export. Is that the best approach or is there any other way to write to mysql? Thanks!

Kafka->HDFS to store as Parquet format

2014-10-05 Thread bdev
We are currently using Camus for Kafka to HDFS pipeline to store as SequenceFiles but I understand Spark Streaming can be used to save as Parquet. As I read about Parquet, the layout is optimized for queries against large file sizes. Are there any options in Spark to specify the block size to help

Re: Kafka->HDFS to store as Parquet format

2014-10-07 Thread bdev
After a bit of looking around, I found saveAsNewAPIHadoopFile could be used to specify the ParquetOutputFormat. Has anyone used it to convert JSON to Parquet format or any pointers are welcome, thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-HD

How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-09 Thread bdev
I'm using KafkaUtils.createStream for the input stream to pull messages from kafka which seems to return a ReceiverInputDStream. I do not see saveAsNewAPIHadoopFile available on ReceiverInputDStream and obviously run into this error: saveAsNewAPIHadoopFile is not a member of org.apache.spark.stre

Error while running Streaming examples - no snappyjava in java.library.path

2014-10-19 Thread bdev
I built the latest Spark project and I'm running into these errors when attempting to run the streaming examples locally on the Mac, how do I fix these errors? java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)

Algebird using spark-shell

2014-10-29 Thread bdev
I'm running into this error when I attempt to launch spark-shell passing in the algebird-core jar: ~~ $ ./bin/spark-shell --jars algebird-core_2.9.2-0.1.11.jar scala> import com.twitter.algebird._ import com.twitter.algebird._ scala> import HyperLogLog._ import HyperLogLog._ scala

Any patterns for multiplexing the streaming data

2014-11-06 Thread bdev
We are looking at consuming the kafka stream using Spark Streaming and transform into various subsets like applying some transformation or de-normalizing some fields, etc. and feed it back into Kafka as a different topic for downstream consumers. Wanted to know if there are any existing patterns f