Hi
I wanted to store DataFrames as partitioned Hive tables. Is there a way to
do this via the saveAsTable call. The set of options does not seem to be
documented.
def
saveAsTable(tableName: String, source: String, mode: SaveMode, options:
Map[String, String]): Unit
(Scala-specific) Creates a tabl
Sean
Dstream.saveAsTextFiles internally calls foreachRDD and saveAsTextFile for
each interval
def saveAsTextFiles(prefix: String, suffix: String = "") {
val saveFunc = (rdd: RDD[T], time: Time) => {
val file = rddToFileName(prefix, suffix, time)
rdd.saveAsTextFile(file)
}
Hi
Shark supported both the HiveServer1 and HiveServer2 thrift interfaces
(using $ bin/shark -service sharkserver[1 or 2]).
SparkSQL seems to support only HiveServer2. I was wondering what is involved
to add support for HiveServer1. Is this something straightforward to do that
I can embark on mys
Hi Landon
I had a problem very similar to your, where we have to process around 5
million relatively small files on NFS. After trying various options, we did
something similar to what Matei suggested.
1) take the original path and find the subdirectories under that path and
then parallelize the
Hi Aureliano
If you have managed to get a custom version of saveAsObject() that handles
compression working, would appreciate if you could share the code. I have
come across the same issue and it would help me some time having to reinvent
the wheel.
Deenar
--
View this message in context:
h
Jaonary
val loadedData: RDD[(String,(String,Array[Byte]))] =
sc.objectFile("yourObjectFileName")
Deenar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Reload-RDD-saved-with-saveAsObjectFile-tp2943p3009.html
Sent from the Apache Spark User List mailing
Matei
It turns out that saveAsObjectFile(), saveAsSequenceFile() and
saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano
found out in this post
http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html
Deenar
--
Aureliano
Apologies for hijacking this thread.
Matei
On the subject of processing lots (millions) of small input files on HDFS,
what are the best practices to follow on spark. Currently my code looks
something like this. Without coalesce there is one task and one output file
per input file. But
Hi
Is there a way in Spark to run a function on each executor just once. I have
a couple of use cases.
a) I use an external library that is a singleton. It keeps some global state
and provides some functions to manipulate it (e.g. reclaim memory. etc.) . I
want to check the global state of this
Christopher
It is once per JVM. TaskNonce would meet my needs. I guess if I want it once
per thread, then a ThreadLocal would do the same.
But how do I invoke TaskNonce, what is the best way to generate a RDD to
ensure that there is one element per executor.
Deenar
--
View this message in c
Hi Christopher
>>which you would invoke as TaskNonce.getSingleton().doThisOnce() from
within the map closure.
Say I have a cluster with 24 workers (one thread per worker
SPARK_WORKER_CORES). My application would have 24 executors each with its
own VM.
The RDDs i process have millions of rows and
Christopher
Sorry I might be missing the obvious, but how do i get my function called on
all Executors used by the app? I dont want to use RDDs unless necessary.
once I start my shell or app, how do I get
TaskNonce.getSingleton().doThisOnce() executed on each executor?
@dmpour
>>rdd.mapPartitio
Hi
I am running calling a C++ library on Spark using JNI. Occasionally the C++
library causes the JVM to crash. The task terminates on the MASTER, but the
driver does not return. I am not sure why the driver does not terminate. I
also notice that after such an occurrence, I lose some workers perma
Hi
I am using Spark to distribute computationally intensive tasks across the
cluster. Currently I partition my RDD of tasks randomly. There is a large
variation in how long each of the jobs take to complete, leading to most
partitions being processed quickly and a couple of partitions take forever
Yes
On a job I am currently running, 99% of the partitions finish within seconds
and a couple of partitions take around and hour to finish. I am pricing some
instruments and complex instruments take far longer to price than plain
vanilla ones. If I could distribute these complex instruments evenly
I have equal sized partitions now, but I want the RDD to be partitioned such
that the partitions are equally weighted by some attribute of each RDD
element (e.g. size or complexity).
I have been looking at the RangePartitioner code and I have come up with
something like
EquallyWeightedPartitioner
This is my first implementation. There are a few rough edges, but when I run
this I get the following exception. The class extends Partitioner which in
turn extends Serializable. Any idea what I am doing wrong?
scala> res156.partitionBy(new EqualWeightPartitioner(1000, res156,
weightFunction))
14/
17 matches
Mail list logo