from:"deenar.toraskar"

DataFrame saveAsTable - partitioned tables

2015-03-22 Thread deenar.toraskar

Hi I wanted to store DataFrames as partitioned Hive tables. Is there a way to do this via the saveAsTable call. The set of options does not seem to be documented. def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit (Scala-specific) Creates a tabl

Re: converting DStream[String] into RDD[String] in spark streaming

2015-03-22 Thread deenar.toraskar

Sean Dstream.saveAsTextFiles internally calls foreachRDD and saveAsTextFile for each interval def saveAsTextFiles(prefix: String, suffix: String = "") { val saveFunc = (rdd: RDD[T], time: Time) => { val file = rddToFileName(prefix, suffix, time) rdd.saveAsTextFile(file) }

HiveServer1 and SparkSQL

2014-10-07 Thread deenar.toraskar

Hi Shark supported both the HiveServer1 and HiveServer2 thrift interfaces (using $ bin/shark -service sharkserver[1 or 2]). SparkSQL seems to support only HiveServer2. I was wondering what is involved to add support for HiveServer1. Is this something straightforward to do that I can embark on mys

Re: Strategies for reading large numbers of files

2014-10-07 Thread deenar.toraskar

Hi Landon I had a problem very similar to your, where we have to process around 5 million relatively small files on NFS. After trying various options, we did something similar to what Matei suggested. 1) take the original path and find the subdirectories under that path and then parallelize the

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-21 Thread deenar.toraskar

Hi Aureliano If you have managed to get a custom version of saveAsObject() that handles compression working, would appreciate if you could share the code. I have come across the same issue and it would help me some time having to reinvent the wheel. Deenar -- View this message in context: h

Re: Reload RDD saved with saveAsObjectFile

2014-03-21 Thread deenar.toraskar

Jaonary val loadedData: RDD[(String,(String,Array[Byte]))] = sc.objectFile("yourObjectFileName") Deenar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reload-RDD-saved-with-saveAsObjectFile-tp2943p3009.html Sent from the Apache Spark User List mailing

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-21 Thread deenar.toraskar

Matei It turns out that saveAsObjectFile(), saveAsSequenceFile() and saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano found out in this post http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html Deenar --

Re: How to save as a single file efficiently?

2014-03-21 Thread deenar.toraskar

Aureliano Apologies for hijacking this thread. Matei On the subject of processing lots (millions) of small input files on HDFS, what are the best practices to follow on spark. Currently my code looks something like this. Without coalesce there is one task and one output file per input file. But

Running a task once on each executor

2014-03-25 Thread deenar.toraskar

Hi Is there a way in Spark to run a function on each executor just once. I have a couple of use cases. a) I use an external library that is a singleton. It keeps some global state and provides some functions to manipulate it (e.g. reclaim memory. etc.) . I want to check the global state of this

Re: Running a task once on each executor

2014-03-25 Thread deenar.toraskar

Christopher It is once per JVM. TaskNonce would meet my needs. I guess if I want it once per thread, then a ThreadLocal would do the same. But how do I invoke TaskNonce, what is the best way to generate a RDD to ensure that there is one element per executor. Deenar -- View this message in c

Re: Running a task once on each executor

2014-03-27 Thread deenar.toraskar

Hi Christopher >>which you would invoke as TaskNonce.getSingleton().doThisOnce() from within the map closure. Say I have a cluster with 24 workers (one thread per worker SPARK_WORKER_CORES). My application would have 24 executors each with its own VM. The RDDs i process have millions of rows and

Re: Running a task once on each executor

2014-03-27 Thread deenar.toraskar

Christopher Sorry I might be missing the obvious, but how do i get my function called on all Executors used by the app? I dont want to use RDDs unless necessary. once I start my shell or app, how do I get TaskNonce.getSingleton().doThisOnce() executed on each executor? @dmpour >>rdd.mapPartitio

Spark behaviour when executor JVM crashes

2014-04-11 Thread deenar.toraskar

Hi I am running calling a C++ library on Spark using JNI. Occasionally the C++ library causes the JVM to crash. The task terminates on the MASTER, but the driver does not return. I am not sure why the driver does not terminate. I also notice that after such an occurrence, I lose some workers perma

Equally weighted partitions in Spark

2014-05-01 Thread deenar.toraskar

Hi I am using Spark to distribute computationally intensive tasks across the cluster. Currently I partition my RDD of tasks randomly. There is a large variation in how long each of the jobs take to complete, leading to most partitions being processed quickly and a couple of partitions take forever

Re: Equally weighted partitions in Spark

2014-05-01 Thread deenar.toraskar

Yes On a job I am currently running, 99% of the partitions finish within seconds and a couple of partitions take around and hour to finish. I am pricing some instruments and complex instruments take far longer to price than plain vanilla ones. If I could distribute these complex instruments evenly

Re: Equally weighted partitions in Spark

2014-05-02 Thread deenar.toraskar

I have equal sized partitions now, but I want the RDD to be partitioned such that the partitions are equally weighted by some attribute of each RDD element (e.g. size or complexity). I have been looking at the RangePartitioner code and I have come up with something like EquallyWeightedPartitioner

Re: Equally weighted partitions in Spark

2014-05-15 Thread deenar.toraskar

This is my first implementation. There are a few rough edges, but when I run this I get the following exception. The class extends Partitioner which in turn extends Serializable. Any idea what I am doing wrong? scala> res156.partitionBy(new EqualWeightPartitioner(1000, res156, weightFunction)) 14/

DataFrame saveAsTable - partitioned tables

Re: converting DStream[String] into RDD[String] in spark streaming

HiveServer1 and SparkSQL

Re: Strategies for reading large numbers of files

Re: SequenceFileRDDFunctions cannot be used output of spark package

Re: Reload RDD saved with saveAsObjectFile

Re: SequenceFileRDDFunctions cannot be used output of spark package

Re: How to save as a single file efficiently?

Running a task once on each executor

Re: Running a task once on each executor

Re: Running a task once on each executor

Re: Running a task once on each executor

Spark behaviour when executor JVM crashes

Equally weighted partitions in Spark

Re: Equally weighted partitions in Spark

Re: Equally weighted partitions in Spark

Re: Equally weighted partitions in Spark

17 matches

Site Navigation

Mail list logo

Footer information