Just trying to get started with Spark and attempting to use HiveContext using
spark-shell to interact with existing Hive tables on my CDH cluster but keep
running into the errors (pls see below) when I do 'hiveContext.sql("show
tables")'. Wanted to know what all JARs need to be included to have thi
I need to save the dataframe to parquet format and need some input on
choosing the appropriate block size to help efficiently parallelize/localize
the data to the executors. Should I be using parquet block size or hdfs
block size and what is the optimal block size to use on a 100 node cluster?
Th
I'm running pyspark with deploy-mode as client with yarn using dynamic
allocation:
pyspark --master yarn --deploy-mode client --executor-memory 6g
--executor-cores 4 --driver-memory 4g
The node where I'm running pyspark has 4GB memory but I keep running out of
memory on this node. If using yarn,
Thanks Mandar for the clarification.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Need-clarification-regd-deploy-mode-client-tp26719p26725.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
I keep running out of memory on the driver when I attempt to do df.show().
Can anyone let me know how to estimate the size of the dataframe?
Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.
Thanks Mandar, I couldn't see anything under the 'Storage Section' but under
the Executors I noticed it to be 3.1 GB:
Executors (1)
Memory: 0.0 B Used (3.1 GB Total)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-usin
I've got ~500 tab delimited log files 25gigs each with page name and userId
who viewed the page along with timestamp.
I'm trying to build a basic spark app to get a unique visitors per page. I
was able to achieve this using SparkSQL by registering the RDD of a case
class and running a select with
Forgot to mention, I'm using Spark 1.0.0 and running against 40 node
yarn-cluster.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12088.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks, will give that a try.
I see the number of partitions requested is 8 (through HashPartitioner(8)).
If I have a 40 node cluster, whats the recommended number of partitions?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp1
Thanks Daniel for the detailed information. Since the RDD is already
partitioned, there is no need to worry about repartitioning.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html
Sent from the Apache Spark User Li
I have similar requirement to export the data to mysql. Just wanted to know
what the best approach is so far after the research you guys have done.
Currently thinking of saving to hdfs and use sqoop to handle export. Is that
the best approach or is there any other way to write to mysql? Thanks!
We are currently using Camus for Kafka to HDFS pipeline to store as
SequenceFiles but I understand Spark Streaming can be used to save as
Parquet. As I read about Parquet, the layout is optimized for queries
against large file sizes. Are there any options in Spark to specify the
block size to help
After a bit of looking around, I found saveAsNewAPIHadoopFile could be used
to specify the ParquetOutputFormat. Has anyone used it to convert JSON to
Parquet format or any pointers are welcome, thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-HD
I'm using KafkaUtils.createStream for the input stream to pull messages from
kafka which seems to return a ReceiverInputDStream. I do not see
saveAsNewAPIHadoopFile available on ReceiverInputDStream and obviously run
into this error:
saveAsNewAPIHadoopFile is not a member of
org.apache.spark.stre
I built the latest Spark project and I'm running into these errors when
attempting to run the streaming examples locally on the Mac, how do I fix
these errors?
java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
I'm running into this error when I attempt to launch spark-shell passing in
the algebird-core jar:
~~
$ ./bin/spark-shell --jars algebird-core_2.9.2-0.1.11.jar
scala> import com.twitter.algebird._
import com.twitter.algebird._
scala> import HyperLogLog._
import HyperLogLog._
scala
We are looking at consuming the kafka stream using Spark Streaming and
transform into various subsets like applying some transformation or
de-normalizing some fields, etc. and feed it back into Kafka as a different
topic for downstream consumers.
Wanted to know if there are any existing patterns f
17 matches
Mail list logo