SparkSQL - Partitioned Parquet

2014-07-06 Thread Raffael Marty
Does SparkSQL support partitioned parquet tables? How do I save to a partitioned parquet file from within Python? table.saveAsParquetFile("table.parquet”) This call doesn’t seem to support a partition argument. Or does my schemaRDD have to be setup a specific way?

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Ankur Dave
Well, the alternative is to do a deep equality check on the index arrays, which would be somewhat expensive since these are pretty large arrays (one element per vertex in the graph). But, in case the reference equality check fails, it actually might be a good idea to do the deep check before resort

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Robert James
I can say from my experience that getting Spark to work with Hadoop 2 is not for the beginner; after solving one problem after another (dependencies, scripts, etc.), I went back to Hadoop 1. Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure why, but, given so, Hadoop 2 has too man

Data loading to Parquet using spark

2014-07-06 Thread Shaikh Riyaz
Hi, We are planning to use spark to load data to Parquet and this data will be query by Impala for present visualization through Tableau. Can we achieve this flow? How to load data to Parquet from spark? Will impala be able to access the data loaded by spark? I will greatly appreciate if someone

Controlling amount of data sent to slaves

2014-07-06 Thread asylvest
I'm in the process of evaluating Spark to see if it's a fit for my CPU-intensive application. Many operations in my chain are highly parallelizable, but some require a minimum number of rows of an input image in order to operate. Is there a way to give Spark a minimum and/or maximum size to send

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Marco Shaw
That is confusing based on the context you provided. This might take more time than I can spare to try to understand. For sure, you need to add Spark to run it in/on the HDP 2.1 express VM. Cloudera's CDH 5 express VM includes Spark, but the service isn't running by default. I can't rememb

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Konstantin Kudryavtsev
Marco, Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf HDP 2.1 means YARN, at the same time they propose ti install rpm On other hand, http://spark.apache.org/ said " Integrated with

Re: reading compress lzo files

2014-07-06 Thread Andrew Ash
Ni Nick, The cluster I was working on in those linked messages was a private data center cluster, not on EC2. I'd imagine that the setup would be pretty similar, but I'm not familiar with the EC2 init scripts that Spark uses. Also I upgraded that cluster to 1.0 recently and am continuing to use

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Marco Shaw
Can you provide links to the sections that are confusing? My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries do. Now, you can also install Hortonworks Spark RPM... For production, in my opinion, RPMs are better for manageability. > On Jul 6, 2014, at 5:39 PM, Konst

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Konstantin Kudryavtsev
Hello, thanks for your message... I'm confused, Hortonworhs suggest install spark rpm on each node, but on Spark main page said that yarn enough and I don't need to install it... What the difference? sent from my HTC On Jul 6, 2014 8:34 PM, "vs" wrote: > Konstantin, > > HWRK provides a Tech Prev

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Koert Kuipers
probably a dumb question, but why is reference equality used for the indexes? On Sun, Jul 6, 2014 at 12:43 AM, Ankur Dave wrote: > When joining two VertexRDDs with identical indexes, GraphX can use a fast > code path (a zip join without any hash lookups). However, the check for > identical inde

Re: reading compress lzo files

2014-07-06 Thread Sean Owen
Pardon, I was wrong about this. There is actually code distributed under com.hadoop, and that's where this class is. Oops. https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/source/browse/trunk/src/java/com/hadoop/mapreduce/LzoTextInputFormat.java On Sun, Jul 6, 2014 at 6:37 AM,

Re: reading compress lzo files

2014-07-06 Thread Nicholas Chammas
I’ve been reading through several pages trying to figure out how to set up my spark-ec2 cluster to read LZO-compressed files from S3. - http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E - http://ma

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread vs
Konstantin, HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf Let me know if you see issues with the tech preview. "spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://s

Re: Addind and subtracting workers on Spark EC2 cluster

2014-07-06 Thread Nicholas Chammas
On Sun, Jul 6, 2014 at 10:10 AM, Robert James wrote: If I've created a Spark EC2 cluster, how can I add or take away workers? There is a manual process by which this is possible, but I’m not sure of the procedure. There is also SPARK-2008 , which

Re: reading compress lzo files

2014-07-06 Thread Nicholas Chammas
Ah, indeed it looks like I need to install this separately as it is not part of the core. Nick On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh wrote: > On 07/06/2014 05:19 AM, Nicholas Chammas wrote: > > O

Addind and subtracting workers on Spark EC2 cluster

2014-07-06 Thread Robert James
If I've created a Spark EC2 cluster, how can I add or take away workers? Also: If I use EC2 spot instances, what happens when Amazon removes them? Will my computation be saved in any way, or will I need to restart from scratch? Finally: The spark-ec2 scripts seem to use Hadoop 1. How can I confi

Re: window analysis with Spark and Spark streaming

2014-07-06 Thread alessandro finamore
On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List] wrote: > Key idea is to simulate your app time as you enter data . So you can connect > spark streaming to a queue and insert data in it spaced by time. Easier said > than done :). I see. I'll try to implement also this solution so