Re: Convert SchemaRDD to RDD

2015-10-16 Thread Ted Yu
2 => val2 > //... cases up to 26 > } > } > ​ > hence expecting an approach to convert SchemaRDD to RDD without using > Tuple or Case Class as we have restrictions in Scala 2.10 > > Regards > Satish Chandra >

Re: Convert SchemaRDD to RDD

2015-10-16 Thread satish chandra j
(that: Any): Boolean = that.isInstanceOf[MyRecord] def productArity: Int = 26 // example value, it is amount of arguments def productElement(n: Int): Serializable = n match { case 1 => val1 case 2 => val2 //... cases up to 26 } } ​ hence expecting an approach to convert

Re: Convert SchemaRDD to RDD

2015-10-16 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTt9YBFr17u8j8&subj=Scala+Limitation+Case+Class+definition+with+more+than+22+arguments On Fri, Oct 16, 2015 at 7:41 AM, satish chandra j wrote: > Hi All, > To convert SchemaRDD to RDD below snipped is working if SQL sta

Convert SchemaRDD to RDD

2015-10-16 Thread satish chandra j
Hi All, To convert SchemaRDD to RDD below snipped is working if SQL statement has columns in a row are less than 22 as per tuple restriction rdd.map(row => row.toString) But if SQL statement has columns more than 22 than the above snippet will error "*object Tuple27 is not a member of

Re: Converting a DStream to schemaRDD

2015-09-29 Thread Adrian Tanase
, September 29, 2015 at 5:09 PM To: Daniel Haviv, user Subject: RE: Converting a DStream to schemaRDD Something like: dstream.foreachRDD { rdd => val df = sqlContext.read.json(rdd) df.select(…) } https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations

RE: Converting a DStream to schemaRDD

2015-09-29 Thread Ewan Leith
ork it as if it were a standard RDD dataset. Ewan From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com] Sent: 29 September 2015 15:03 To: user Subject: Converting a DStream to schemaRDD Hi, I have a DStream which is a stream of RDD[String]. How can I pass a DStream to sqlContext.jsonRDD

Converting a DStream to schemaRDD

2015-09-29 Thread Daniel Haviv
Hi, I have a DStream which is a stream of RDD[String]. How can I pass a DStream to sqlContext.jsonRDD and work with it as a DF ? Thank you. Daniel

Re: Nested DataFrame(SchemaRDD)

2015-06-24 Thread Richard Catlin
wrote: > >> I wrote a brief howto on building nested records in spark and storing >> them in parquet here: >> http://www.congiu.com/creating-nested-data-parquet-in-spark-sql/ >> >> 2015-06-23 16:12 GMT-07:00 Richard Catlin : >> >>> How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a >>> column? Is there an example? Will this store as a nested parquet file? >>> >>> Thanks. >>> >>> Richard Catlin >>> >> >> >

Re: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Michael Armbrust
015-06-23 16:12 GMT-07:00 Richard Catlin : > >> How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a >> column? Is there an example? Will this store as a nested parquet file? >> >> Thanks. >> >> Richard Catlin >> > >

Re: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Roberto Congiu
I wrote a brief howto on building nested records in spark and storing them in parquet here: http://www.congiu.com/creating-nested-data-parquet-in-spark-sql/ 2015-06-23 16:12 GMT-07:00 Richard Catlin : > How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a > column? Is

RE: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Richard Catlin
How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a column? Is there an example? Will this store as a nested parquet file? Thanks. Richard Catlin

Re: Format RDD/SchemaRDD contents to screen?

2015-05-29 Thread ayan guha
Depending on your spark version, you can convert schemaRDD to a dataframe and then use .show() On 30 May 2015 10:33, "Minnow Noir" wrote: > I"m trying to debug query results inside spark-shell, but finding it > cumbersome to save to file and then use file system utils to

Format RDD/SchemaRDD contents to screen?

2015-05-29 Thread Minnow Noir
ay to present the contents of an RDD/SchemaRDD on the screen in a formatted way? For example, say I want to take() the first 30 lines/rows in an *RDD and present them in a readable way on the screen so that I can see what's missing or invalid. Obviously, I'm just trying to sample the result

scala.ScalaReflectionException when creating SchemaRDD

2015-05-25 Thread vkcelik
Hello I am trying to create a SchemaRDD from a RDD of case classes. Depending on an argument to the program, the program reads data of specified type and maps it to the correct case class. But this throws an exception. I am using Spark version 1.1.0 and Scala version 2.10.4 The exception can be

scala.ScalaReflectionException when creating SchemaRDD

2015-05-25 Thread vkcelik
Hello I am trying to create a SchemaRDD from a RDD of case classes. Depending on an argument to the program these case classes should be different. But this throws an exception. I am using Spark version 1.1.0 and Scala version 2.10.4 The exception can be reproduced by: val table = "t

Spark 1.2.1: How to convert SchemaRDD to CassandraRDD?

2015-04-27 Thread Tash Chainar
Hi all, following the import com.datastax.spark.connector.SelectableColumnRef; import com.datastax.spark.connector.japi.CassandraJavaUtil; import org.apache.spark.sql.SchemaRDD; import static com.datastax.spark.connector.util.JavaApiHelper.toScalaSeq; import scala.collection.Seq; SchemaRDD

Spark SQL: SchemaRDD, DataFrame. Multi-value, Nested attributes

2015-04-22 Thread Eugene Morozov
one long list of columns as I would be able to find some weird stuff by doing that. So my question is the following: 1. Does SchemaRDD support something like multi value attributes? It might look like and array of values that lives in just one column. Although it’s not clear how I’d aggregate

saving schemaRDD to cassandra

2015-03-27 Thread Hafiz Mujadid
Hi experts! I would like to know is there anyway to store schemaRDD to cassandra? if yes then how to store in existing cassandra column family and new column family? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saving-schemaRDD-to-cassandra

Re: SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

2015-03-23 Thread Michael Armbrust
g our assumptions about partitioning. On Mon, Mar 23, 2015 at 10:22 AM, Stephen Boesch wrote: > > Is there a way to take advantage of the underlying datasource partitions > when generating a DataFrame/SchemaRDD via catalyst? It seems from the sql > module that the only options are Ran

SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

2015-03-23 Thread Stephen Boesch
Is there a way to take advantage of the underlying datasource partitions when generating a DataFrame/SchemaRDD via catalyst? It seems from the sql module that the only options are RangePartitioner and HashPartitioner - and further that those are selected automatically by the code . It was not

Re: Using regular rdd transforms on schemaRDD

2015-03-17 Thread kpeng1
Looks like if I use unionAll this works. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-regular-rdd-transforms-on-schemaRDD-tp22105p22107.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Using regular rdd transforms on schemaRDD

2015-03-17 Thread kpeng1
Hi All, I was wondering how rdd transformation work on schemaRDDs. Is there a way to force the rdd transform to keep the schemaRDD types or do I need to recreate the schemaRDD by applying the applySchema method? Currently what I have is an array of SchemaRDDs and I just want to do a union

Re: Iterate over contents of schemaRDD loaded from parquet file to extract timestamp

2015-03-16 Thread Cheng Lian
owing type data from a parquet file, stored in a schemaRDD [7654321,2015-01-01 00:00:00.007,0.49,THU] Since, in spark version 1.1.0, parquet format doesn't support saving timestamp valuues, I have saved the timestamp data as string. Can you please tell me how to iterate over the data in this sc

Iterate over contents of schemaRDD loaded from parquet file to extract timestamp

2015-03-16 Thread anu
Spark Version - 1.1.0 Scala - 2.10.4 I have loaded following type data from a parquet file, stored in a schemaRDD [7654321,2015-01-01 00:00:00.007,0.49,THU] Since, in spark version 1.1.0, parquet format doesn't support saving timestamp valuues, I have saved the timestamp data as string. Ca

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Tobias Pfeiffer
s far as I know, registerTempTable is just a Map[String, SchemaRDD] insertion, nothing that would be measurable. But there are no distributed/RDD operations involved, I think. Tobias

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Cesar Flores
transformers classes for feature extraction, and If I need to save the input and maybe output SchemaRDD of the transform function in every transformer, this may not very efficient. Thanks On Tue, Mar 10, 2015 at 8:20 PM, Tobias Pfeiffer wrote: > Hi, > > On Tue, Mar 10, 2015 at 2:13 PM, Ces

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Tobias Pfeiffer
Hi, On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores wrote: > I am new to the SchemaRDD class, and I am trying to decide in using SQL > queries or Language Integrated Queries ( > https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD > ). > > C

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Reynold Xin
They should have the same performance, as they are compiled down to the same execution plan. Note that starting in Spark 1.3, SchemaRDD is renamed DataFrame: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html On Tue, Mar 10, 2015 at 2:13

SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Cesar Flores
I am new to the SchemaRDD class, and I am trying to decide in using SQL queries or Language Integrated Queries ( https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ). Can someone tell me what is the main difference between the two approaches, besides using

Re: Construct model matrix from SchemaRDD automatically

2015-03-05 Thread Evan R. Sparks
Hi Wush, I'm CC'ing user@spark.apache.org (which is the new list) and BCC'ing u...@spark.incubator.apache.org. In Spark 1.3, schemaRDD is in fact being renamed to DataFrame (see: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.h

Construct model matrix from SchemaRDD automatically

2015-03-05 Thread Wush Wu
Dear all, I am a new spark user from R. After exploring the schemaRDD, I notice that it is similar to data.frame. Is there a feature like `model.matrix` in R to convert schemaRDD to model matrix automatically according to the type without explicitly converting them one by one? Thanks, Wush

Spark Streaming and SchemaRDD usage

2015-03-04 Thread Haopu Wang
Hi, in the roadmap of Spark in 2015 (link: http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark Streaming and Spark SQL. My question is: what's the typical usage of SchemaRDD in a Spark Streaming applic

Re: Spark SQL Converting RDD to SchemaRDD without hardcoding a case class in scala

2015-02-27 Thread Michael Armbrust
have seen it looks like spark sql is > the way to go and how I would go about this would be to load in the csv > file > into an RDD and convert it into a schemaRDD by injecting in the schema via > a > case class. > > What I want to avoid is hard coding in the case class itself. I

Spark SQL Converting RDD to SchemaRDD without hardcoding a case class in scala

2015-02-27 Thread kpeng1
Hi All, I am currently trying to build out a spark job that would basically convert a csv file into parquet. From what I have seen it looks like spark sql is the way to go and how I would go about this would be to load in the csv file into an RDD and convert it into a schemaRDD by injecting in

Re: Converting SchemaRDD/Dataframe to RDD[vector]

2015-02-26 Thread Xiangrui Meng
round and see others have asked similar questions. > > Given a schemaRDD I extract a restless that contains numbers, both Int and > Doubles. How do I construct a RDD[Vector]? In 1.2 I wrote the results to a > textile and then read them back in splitting them with some code I found in

Converting SchemaRDD/Dataframe to RDD[vector]

2015-02-26 Thread mobsniuk
I've been searching around and see others have asked similar questions. Given a schemaRDD I extract a restless that contains numbers, both Int and Doubles. How do I construct a RDD[Vector]? In 1.2 I wrote the results to a textile and then read them back in splitting them with some code I fou

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2015-02-23 Thread Michael Armbrust
t; >> Hi Michael, >> >> I think that the feature (convert a SchemaRDD to a structured class RDD) >> is >> now available. But I didn't understand in the PR how exactly to do this. >> Can >> you give an example or doc links? >> >> Best regards &g

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2015-02-22 Thread Ted Yu
eb 22, 2015 at 11:51 AM, stephane.collot wrote: > Hi Michael, > > I think that the feature (convert a SchemaRDD to a structured class RDD) is > now available. But I didn't understand in the PR how exactly to do this. > Can > you give an example or doc links? > > B

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2015-02-22 Thread stephane.collot
Hi Michael, I think that the feature (convert a SchemaRDD to a structured class RDD) is now available. But I didn't understand in the PR how exactly to do this. Can you give an example or doc links? Best regards -- View this message in context: http://apache-spark-user-list.10015

how to get SchemaRDD SQL exceptions i.e. table not found exception

2015-02-13 Thread sachin Singh
Hi, can some one guide how to get SQL Exception trapped for query executed using SchemaRDD, i mean suppose table not found thanks in advance, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-get-SchemaRDD-SQL-exceptions-i-e-table-not-found

Re: Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-11 Thread nitin
timisation-in-SchemaRDD-tp21555p21613.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-09 Thread nitin
Hi All, I have a use case where I have cached my schemaRDD and I want to launch executors just on the partition which I know of (prime use-case of PartitionPruningRDD). I tried something like following :- val partitionIdx = 2 val schemaRdd = hiveContext.table("myTable") //myTable is

Re: LeaseExpiredException while writing schemardd to hdfs

2015-02-05 Thread Petar Zecevic
Why don't you just map rdd's rows to lines and then call saveAsTextFile()? On 3.2.2015. 11:15, Hafiz Mujadid wrote: I want to write whole schemardd to single in hdfs but facing following exception rg.apache.hadoop.ipc.Remot

Re: Error in saving schemaRDD with Decimal as Parquet

2015-02-03 Thread Manoj Samel
Hi, Any thoughts ? Thanks, On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel wrote: > Spark 1.2 > > SchemaRDD has schema with decimal columns created like > > x1 = new StructField("a", DecimalType(14,4), true) > > x2 = new StructField("b", DecimalType(14,4)

LeaseExpiredException while writing schemardd to hdfs

2015-02-03 Thread Hafiz Mujadid
I want to write whole schemardd to single in hdfs but facing following exception rg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /test/data/data1.csv (inode 402042): File does not exist. Holder DFSClient_NONMAPREDUCE_-564238432_57

Re: Error in saving schemaRDD with Decimal as Parquet

2015-02-01 Thread Manoj Samel
I think I found the issue causing it. I was calling schemaRDD.coalesce(n).saveAsParquetFile to reduce the number of partitions in parquet file - in which case the stack trace happens. If I compress the partitions before creating schemaRDD then the schemaRDD.saveAsParquetFile call works for

Error in saving schemaRDD with Decimal as Parquet

2015-02-01 Thread Manoj Samel
Spark 1.2 SchemaRDD has schema with decimal columns created like x1 = new StructField("a", DecimalType(14,4), true) x2 = new StructField("b", DecimalType(14,4), true) Registering as SQL Temp table and doing SQL queries on these columns , including SUM etc. works fine, s

StackOverflowError with SchemaRDD

2015-01-28 Thread ankits
Hi, I am getting a stack overflow error when querying a schemardd comprised of parquet files. This is (part of) the stack trace: Caused by: java.lang.StackOverflowError at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-20 Thread Cheng Lian
Nathan <mailto:nathan.mccar...@quantium.com.au>>, Michael Armbrust mailto:mich...@databricks.com>> Cc: "user@spark.apache.org <mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL schemaRDD & MapPartitions calls - performance issues -

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-16 Thread Michael Armbrust
> Cheers, > Nathan > > From: Cheng Lian > Date: Monday, 12 January 2015 1:21 am > To: Nathan , Michael Armbrust < > mich...@databricks.com> > Cc: "user@spark.apache.org" > > Subject: Re: SparkSQL schemaRDD & MapPartitions calls - performance >

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-15 Thread Nathan McCarthy
apache.org>> Subject: Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats? On 1/11/15 1:40 PM, Nathan McCarthy wrote: Thanks Cheng & Michael! Makes sense. Appreciate the tips! Idiomatic scala isn't performant. I’ll definitely start using while loo

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-11 Thread Cheng Lian
Nathan <mailto:nathan.mccar...@quantium.com.au>>, "user@spark.apache.org <mailto:user@spark.apache.org>" <mailto:user@spark.apache.org>> Subject: Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats? The other thing to note here i

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-10 Thread Nathan McCarthy
t mailto:mich...@databricks.com>> Date: Saturday, 10 January 2015 3:41 am To: Cheng Lian mailto:lian.cs@gmail.com>> Cc: Nathan mailto:nathan.mccar...@quantium.com.au>>, "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>>

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-09 Thread Michael Armbrust
nwrapping" processes you mentioned below). > > > Now this takes around ~49 seconds… Even though test1 table is 100% > cached. The number of partitions remains the same… > > Now if I create a simple RDD of a case class HourSum(hour: Int, qty: > Double, sales: Double) >

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-09 Thread Cheng Lian
49 seconds… Even though test1 table is 100% cached. The number of partitions remains the same… Now if I create a simple RDD of a case class HourSum(hour: Int, qty: Double, sales: Double) Convert the SchemaRDD; val rdd = sqlC.sql("select * from test1").map{ r => HourSum(r.getInt(1), r.

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-08 Thread Nathan McCarthy
Any ideas? :) From: Nathan mailto:nathan.mccar...@quantium.com.au>> Date: Wednesday, 7 January 2015 2:53 pm To: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: SparkSQL schemaRDD & MapPartitions calls - performance

SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-06 Thread Nathan McCarthy
ithIndex.map(_.swap).iterator }.reduceByKey((a,b) => (a._1 + b._1, a._2 + b._2)).collect().foreach(println) Now this takes around ~49 seconds… Even though test1 table is 100% cached. The number of partitions remains the same… Now if I create a simple RDD of a case class HourSum(hour: Int, qty

Re: Add StructType column to SchemaRDD

2015-01-05 Thread Tobias Pfeiffer
Hi Michael, On Tue, Jan 6, 2015 at 3:43 PM, Michael Armbrust wrote: > Oh sorry, I'm rereading your email more carefully. Its only because you > have some setup code that you want to amortize? > Yes, exactly that. Concerning the docs, I'd be happy to contribute, but I don't really understand w

Re: Add StructType column to SchemaRDD

2015-01-05 Thread Michael Armbrust
QL doesn't really have a great support for partitions in general... > We do support for Hive TGFs though and we could possibly add better scala > syntax for this concept or something else. > > On Mon, Jan 5, 2015 at 9:52 PM, Tobias Pfeiffer wrote: > >> Hi, >> >>

Re: Add StructType column to SchemaRDD

2015-01-05 Thread Michael Armbrust
Pfeiffer wrote: > Hi, > > I have a SchemaRDD where I want to add a column with a value that is > computed from the rest of the row. As the computation involves a > network operation and requires setup code, I can't use > "SELECT *, myUDF(*) FROM rdd", > but I w

Add StructType column to SchemaRDD

2015-01-05 Thread Tobias Pfeiffer
Hi, I have a SchemaRDD where I want to add a column with a value that is computed from the rest of the row. As the computation involves a network operation and requires setup code, I can't use "SELECT *, myUDF(*) FROM rdd", but I wanted to use a combination of: - get schema of

Re: SchemaRDD to RDD[String]

2014-12-30 Thread Yana
schemaRDD.first,list) and see if you get anything -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-to-RDD-String-tp20846p20910.html Sent from the Apache Spark User List mailing list a

Re: SchemaRDD to RDD[String]

2014-12-24 Thread Michael Armbrust
You might also try the following, which I think is equivalent: schemaRDD.map(_.mkString(",")) On Wed, Dec 24, 2014 at 8:12 PM, Tobias Pfeiffer wrote: > Hi, > > On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid > wrote: >> >> I want to convert a schemaRDD into

Re: SchemaRDD to RDD[String]

2014-12-24 Thread Tobias Pfeiffer
Hi, On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid wrote: > > I want to convert a schemaRDD into RDD of String. How can we do that? > > Currently I am doing like this which is not converting correctly no > exception but resultant strings are empty > > here is my code >

Re: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-24 Thread Cheng Lian
...@gmail.com] *Sent:* Wednesday, December 24, 2014 4:26 AM *To:* user@spark.apache.org *Subject:* SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD Hi spark users, I'm trying to create external table using HiveContext after creating a schemaRDD and saving the RDD into a parquet file on hdfs. I

SchemaRDD to RDD[String]

2014-12-23 Thread Hafiz Mujadid
Hi dears! I want to convert a schemaRDD into RDD of String. How can we do that? Currently I am doing like this which is not converting correctly no exception but resultant strings are empty here is my code def SchemaRDDToRDD( schemaRDD : SchemaRDD ) : RDD[ String ] = { var

RE: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Cheng, Hao
@spark.apache.org Subject: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD Hi spark users, I'm trying to create external table using HiveContext after creating a schemaRDD and saving the RDD into a parquet file on hdfs. I would like to use the schema in the schemaRDD (rdd_table) when I creat

SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Jerry Lam
Hi spark users, I'm trying to create external table using HiveContext after creating a schemaRDD and saving the RDD into a parquet file on hdfs. I would like to use the schema in the schemaRDD (rdd_table) when I create the external table. For example: rdd_table.saveAsParquetFile("/

Re: SchemaRDD to Hbase

2014-12-20 Thread Alex Kamil
I'm using JDBCRDD <https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.rdd.JdbcRDD> + Hbase JDBC driver <http://phoenix.apache.org/>+ schemaRDD <https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD> make sure to use

Re: SchemaRDD to Hbase

2014-12-20 Thread Subacini B
Hi , Can someone help me , Any pointers would help. Thanks Subacini On Fri, Dec 19, 2014 at 10:47 PM, Subacini B wrote: > Hi All, > > Is there any API that can be used directly to write schemaRDD to HBase?? > If not, what is the best way to write schemaRDD to HBase. > > Thanks > Subacini >

SchemaRDD to Hbase

2014-12-19 Thread Subacini B
Hi All, Is there any API that can be used directly to write schemaRDD to HBase?? If not, what is the best way to write schemaRDD to HBase. Thanks Subacini

Re: Adding a column to a SchemaRDD

2014-12-15 Thread Nathan Kronenfeld
ect(Star(Node), 'seven.getField("mod"), 'eleven.getField("mod")) > > You need to import org.apache.spark.sql.catalyst.analysis.Star in advance. > > #2 > > After you make the transform above, you do not need to make SchemaRDD > manually. > Because that jdata.select

Re: SchemaRDD partition on specific column values?

2014-12-15 Thread Nitin Goyal
nds up being, but it is certainly possible. > > On Thu, Dec 11, 2014 at 3:55 AM, nitin wrote: >> >> Can we take this as a performance improvement task in Spark-1.2.1? I can >> help >> contribute for this. >> >> >> >> >> -- >> View

Re: Adding a column to a SchemaRDD

2014-12-15 Thread Yanbo Liang
After you make the transform above, you do not need to make SchemaRDD manually. Because that jdata.select() return a SchemaRDD and you can operate on it directly. For example, the following code snippet will return a new SchemaRDD with longer Row: val t1 = jdata.select(Star(Node), 'seven.getFie

Re: Adding a column to a SchemaRDD

2014-12-14 Thread Tobias Pfeiffer
over for how to do this if my added value is a > scala function, with no luck. > > Let's say I have a SchemaRDD with columns A, B, and C, and I want to add a > new column, D, calculated using Utility.process(b, c), and I want (of > course) to pass in the value B and C from each ro

Re: SchemaRDD partition on specific column values?

2014-12-14 Thread Michael Armbrust
nitin wrote: > > Can we take this as a performance improvement task in Spark-1.2.1? I can > help > contribute for this. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-val

Re: Adding a column to a SchemaRDD

2014-12-12 Thread Nathan Kronenfeld
(1) I understand about immutability, that's why I said I wanted a new SchemaRDD. (2) I specfically asked for a non-SQL solution that takes a SchemaRDD, and results in a new SchemaRDD with one new function. (3) The DSL stuff is a big clue, but I can't find adequate documentation for it

Re: Adding a column to a SchemaRDD

2014-12-12 Thread Yanbo Liang
ext(sc) import sqlContext._ val d1 = sc.parallelize(1 to 10).map { i => Person(i,i+1,i+2)} val d2 = d1.select('id, 'score, 'id + 'score) d2.foreach(println) 2014-12-12 14:11 GMT+08:00 Nathan Kronenfeld : > Hi, there. > > I'm trying to understand how to augment d

Adding a column to a SchemaRDD

2014-12-11 Thread Nathan Kronenfeld
Hi, there. I'm trying to understand how to augment data in a SchemaRDD. I can see how to do it if can express the added values in SQL - just run "SELECT *,valueCalculation AS newColumnName FROM table" I've been searching all over for how to do this if my added value is a sca

Re: SchemaRDD partition on specific column values?

2014-12-11 Thread nitin
Can we take this as a performance improvement task in Spark-1.2.1? I can help contribute for this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html Sent from the Apache Spark User List mailing

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-08 Thread Jianshi Huang
7;ve created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782 >>> >>> Jianshi >>> >>> On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang >>> wrote: >>> >>>> Hi, >>>> >>>> What's the best way to

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-08 Thread Yin Huai
ps://issues.apache.org/jira/browse/SPARK-4782 >> >> Jianshi >> >> On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang >> wrote: >> >>> Hi, >>> >>> What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? >>> >>> I

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-07 Thread Jianshi Huang
a JIRA: https://issues.apache.org/jira/browse/SPARK-4782 > > Jianshi > > On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang > wrote: > >> Hi, >> >> What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? >> >> I'm currently converting each Ma

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-06 Thread Jianshi Huang
Hmm.. I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782 Jianshi On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang wrote: > Hi, > > What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? > > I'm currently converting ea

Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-06 Thread Jianshi Huang
Hi, What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? I'm currently converting each Map to a JSON String and do JsonRDD.inferSchema. How about adding inferSchema support to Map[String, Any] directly? It would be very useful. Thanks, -- Jianshi Huang LinkedI

Re: SchemaRDD partition on specific column values?

2014-12-05 Thread Michael Armbrust
gt; > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20424.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --

Re: SchemaRDD partition on specific column values?

2014-12-04 Thread nitin
nge before JOIN step) and improve overall performance? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20424.html Sent from the Apache Spark User List mailing list a

SchemaRDD partition on specific column values?

2014-12-04 Thread nitin
oid the partitioning based on ID by preprocessing it (and then cache it). Thanks in Advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350.html Sent from the Apache Spark User List mailing list a

How to create a new SchemaRDD which is not based on original SparkPlan?

2014-12-03 Thread Tim Chou
Hi All, My question is about lazy running mode for SchemaRDD, I guess. I know lazy mode is good, however, I still have this demand. For example, here is the first SchemaRDD, named result.(select * from table where num>1 and num < 4): results: org.apache.spark.sql.SchemaRDD = SchemaRDD[

Re: SchemaRDD + SQL , loading projection columns

2014-12-03 Thread Vishnusaran Ramaswamy
gs >> where rid = 'dd4455ee' and module = 'query' ") >> >> when i run filteredLogs.collect.foreach(println) , i see all of the 16GB >> data loaded. >> >> How do I load only the columns used in filters first and then load the >> payload

Re: SchemaRDD + SQL , loading projection columns

2014-12-02 Thread Michael Armbrust
row matching the filter criteria? > > Let me know if this can be done in a different way. > > Thanks you, > Vishnu. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-SQL-loading-projection-columns-tp20189

Re: Standard SQL tool access to SchemaRDD

2014-12-02 Thread Jim Carroll
Thanks! I'll give it a try. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-tp20197p20202.html Sent from the Apache Spark User List mailing list archive at Nabbl

Re: Standard SQL tool access to SchemaRDD

2014-12-02 Thread Michael Armbrust
Jim > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-tp20197.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ---

Standard SQL tool access to SchemaRDD

2014-12-02 Thread Jim Carroll
.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-tp20197.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

SchemaRDD + SQL , loading projection columns

2014-12-02 Thread Vishnusaran Ramaswamy
message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-SQL-loading-projection-columns-tp20189.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail

Re: Creating a SchemaRDD from an existing API

2014-12-01 Thread Michael Armbrust
> > On Sat, Nov 29, 2014 at 12:57 AM, Michael Armbrust > wrote: > >> You probably don't need to create a new kind of SchemaRDD. Instead I'd >> suggest taking a look at the data sources API that we are adding in Spark >> 1.2. There is not a ton of docu

Re: Creating a SchemaRDD from an existing API

2014-12-01 Thread Niranda Perera
Hi Michael, About this new data source API, what type of data sources would it support? Does it have to be RDBMS necessarily? Cheers On Sat, Nov 29, 2014 at 12:57 AM, Michael Armbrust wrote: > You probably don't need to create a new kind of SchemaRDD. Instead I'd > suggest

Re: Creating a SchemaRDD from an existing API

2014-11-28 Thread Michael Armbrust
You probably don't need to create a new kind of SchemaRDD. Instead I'd suggest taking a look at the data sources API that we are adding in Spark 1.2. There is not a ton of documentation, but the test cases show how to implement the various interfaces <https://github.com/apache/spar

Creating a SchemaRDD from an existing API

2014-11-27 Thread Niranda Perera
Hi, I am evaluating Spark for an analytic component where we do batch processing of data using SQL. So, I am particularly interested in Spark SQL and in creating a SchemaRDD from an existing API [1]. This API exposes elements in a database as datasources. Using the methods allowed by this data

Re: SchemaRDD compute function

2014-11-26 Thread Michael Armbrust
takeOrdered, etc. On Wed, Nov 26, 2014 at 5:05 AM, Jörg Schad wrote: > Hi, > I have a short question regarding the compute() of an SchemaRDD. > For SchemaRDD the actual queryExecution seems to be triggered via > collect(), while the compute triggers only the compute() of the parent

  1   2   3   >