Spark SQL/DDF's for production

2015-07-21 Thread bipin
Hi I want to ask an issue I have faced while using Spark. I load dataframes from parquet files. Some dataframes' parquet have lots of partitions, >10 million rows. Running "where id = x" query on dataframe scans all partitions. When saving to rdd object/parquet there is a partition column. The men

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-03 Thread bipin
I have a hunch I want to share: I feel that data is not being deallocated in memory (at least like in 1.3). Once it goes in-memory it just stays there. Spark SQL works fine, the same query when run on a new shell won't throw that error, but when run on a shell which has been used for other queries

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-03 Thread bipin
I will second this. I very rarely used to get out-of-memory errors in 1.3. Now I get these errors all the time. I feel that I could work on 1.3 spark-shell for long periods of time without spark throwing that error, whereas in 1.4 the shell needs to be restarted or gets killed frequently. -- Vie

Multiple Join Conditions in dataframe join

2015-07-03 Thread bipin
_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"), "left") When I do this I get error: too many arguments for method apply. Thanks Bipin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mu

Re: BigDecimal problem in parquet file

2015-06-18 Thread Bipin Nag
wide tables. > > Cheng > > > On 6/15/15 5:48 AM, Bipin Nag wrote: > > HI Davies, > > I have tried recent 1.4 and 1.5-snapshot to 1) open the parquet and save > it again or 2 apply schema to rdd and save dataframe as parquet but now I > get this error (right in t

Re: BigDecimal problem in parquet file

2015-06-15 Thread Bipin Nag
bug. My error doesn't show up in newer versions, so this is the problem to fix now. Thanks On 13 June 2015 at 06:31, Davies Liu wrote: > Maybe it's related to a bug, which is fixed by > https://github.com/apache/spark/pull/6558 recently. > > On Fri, Jun 12, 2015 at 5:3

Re: BigDecimal problem in parquet file

2015-06-12 Thread Bipin Nag
have to change it properly. Thanks for helping out. Bipin On 12 June 2015 at 14:57, Cheng Lian wrote: > On 6/10/15 8:53 PM, Bipin Nag wrote: > > Hi Cheng, > > I am using Spark 1.3.1 binary available for Hadoop 2.6. I am loading an > existing parquet file, then repartitioni

Re: BigDecimal problem in parquet file

2015-06-10 Thread Bipin Nag
, 1, lastpk, 1, JdbcRDD.resultSetToObjectArray) myRDD.saveAsObjectFile("rawdata/"+name); For applying schema and saving the parquet: val myschema = schemamap(name) val myrdd = sc.objectFile[Array[Object]]("/home/bipin/rawdata/"+name).map(x => org.apache.spark.s

BigDecimal problem in parquet file

2015-06-09 Thread bipin
Hi, When I try to save my data frame as a parquet file I get the following error: java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to org.apache.spark.sql.types.Decimal at org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)

Re: Error in using saveAsParquetFile

2015-06-08 Thread Bipin Nag
wrote: > I suspect that Bookings and Customerdetails both have a PolicyType field, > one is string and the other is an int. > > > Cheng > > > On 6/8/15 9:15 PM, Bipin Nag wrote: > > Hi Jeetendra, Cheng > > I am using following code for joining > > val

Re: Error in using saveAsParquetFile

2015-06-08 Thread Bipin Nag
e joined DataFrame whose PolicyType is string to an >> existing Parquet file whose PolicyType is int? The exception indicates that >> Parquet found a column with conflicting data types. >> >> Cheng >> >> >> On 6/8/15 5:29 PM, bipin wrote: >> >>

Error in using saveAsParquetFile

2015-06-08 Thread bipin
Hi I get this error message when saving a table: parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: optional binary PolicyType (UTF8) != optional int32 PolicyType at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompa

Create dataframe from saved objectfile RDD

2015-05-31 Thread bipin
quot;/home/bipin/rawdata/"+name) But I get java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to org.apache.spark.sql.Row How to work around this. Is there a better way. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-dataf

Re: How to group multiple row data ?

2015-04-30 Thread Bipin Nag
1,2,3. On 29 April 2015 at 18:04, Manoj Awasthi wrote: > Sorry but I didn't fully understand the grouping. This line: > > >> The group must only take the closest previous trigger. The first one > hence shows alone. > > Can you please explain further? > > >

How to group multiple row data ?

2015-04-29 Thread bipin
Hi, I have a ddf with schema (CustomerID, SupplierID, ProductID, Event, CreatedOn), the first 3 are Long ints and event can only be 1,2,3 and CreatedOn is a timestamp. How can I make a group triplet/doublet/singlet out of them such that I can infer that Customer registered event from 1to 2 and if p

How to merge two dataframes with same schema

2015-04-22 Thread bipin
I have looked into sqlContext documentation but there is nothing on how to merge two data-frames. How can I do this ? Thanks Bipin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-merge-two-dataframes-with-same-schema-tp22606.html Sent from the

Spark REPL no progress when run in cluster

2015-04-21 Thread bipin
Hi all, I am facing an issue, whenever I run a job on my mesos cluster, I cannot see any progress on my terminal. It shows : [Stage 0:>(0 + 0) / 204] I have setup the cluster on AWS EC2 manually. I first run mesos master and slaves, then run s

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
Looks a good option. BTW v3.0 is round the corner. http://slick.typesafe.com/news/2015/04/02/slick-3.0.0-RC3-released.html Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22521.html Sent from the Apach

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
I am running the queries from spark-sql. I don't think it can communicate with thrift server. Can you tell how I should run the quries to make it work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22516.ht

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
I was running the spark shell and sql with --jars option containing the paths when I got my error. What is the correct way to add jars I am not sure. I tried placing the jar inside the directory you said but still get the error. I will give the code you posted a try. Thanks. -- View this message

Sqoop parquet file not working in spark

2015-04-13 Thread bipin
Hi I imported a table from mssql server with Sqoop 1.4.5 in parquet format. But when I try to load it from Spark shell, it throws error like : scala> val df1 = sqlContext.load("/home/bipin/Customer2") scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown duri

Re: Microsoft SQL jdbc support from spark sql

2015-04-06 Thread Bipin Nag
s some thoughts - credit to Cheng Lian for this - > about making the JDBC data source extensible for third party support > possibly via slick. > > > On Mon, Apr 6, 2015 at 10:41 PM bipin wrote: > >> Hi, I am trying to pull data from ms-sql server. I have tried using the

Microsoft SQL jdbc support from spark sql

2015-04-06 Thread bipin
Hi, I am trying to pull data from ms-sql server. I have tried using the spark.sql.jdbc CREATE TEMPORARY TABLE c USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;", dbtable "Customer" ); But it shows java.sql.SQLException: No suitable driver fou