Re: Spark Python with SequenceFile containing numpy deserialized data in str form

2015-08-30 Thread Peter Aberline
Hi, I saw the posting about storing NumPy values in sequence files: http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3cCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3e I’ve had a go at implementing this, and issued a PR request at https://github.com/apach

Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread Peter Aberline
dd and > then save it through your own checkpoint mechanism. > > If not, please share your use case. > On 11 May 2015 00:38, "Peter Aberline" wrote: > >> Hi >> >> I have many thousands of small DataFrames that I would like to save to >> the one Parquet fil

Multiple DataFrames per Parquet file?

2015-05-10 Thread Peter Aberline
Hi I have many thousands of small DataFrames that I would like to save to the one Parquet file to avoid the HDFS 'small files' problem. My understanding is that there is a 1:1 relationship between DataFrames and Parquet files if a single partition is used. Is it possible to have multiple DataFram

Spark-submit ClassNotFoundException with JAR!

2014-09-08 Thread Peter Aberline
Hi, I'm having problems with a ClassNotFoundException using this simple example: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import java.net.URLClassLoader import scala.util.Marshal class ClassToRoundTrip(val id: Int) extends s