Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Alexandr Dzhagriev
, because schema verification is a good thing i > would assume? > > On Tue, Feb 9, 2016 at 3:25 PM, Alexandr Dzhagriev > wrote: > >> Hi Koert, >> >> As far as I can see you are using derby: >> >> Using direct SQL, underlying DB is DERBY >> &

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Alexandr Dzhagriev
Hi Koert, As far as I can see you are using derby: Using direct SQL, underlying DB is DERBY not mysql, which is used for the metastore. That means, spark couldn't find hive-site.xml on your classpath. Can you check that, please? Thanks, Alex. On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers wro

spark-cassandra-connector BulkOutputWriter

2016-02-09 Thread Alexandr Dzhagriev
Hello all, I looked through the cassandra spark integration ( https://github.com/datastax/spark-cassandra-connector) and couldn't find any usages of the BulkOutputWriter ( http://www.datastax.com/dev/blog/bulk-loading) - an awesome tool for creating local sstables, which could be later uploaded to

Re: java.lang.ArrayIndexOutOfBoundsException when attempting broadcastjoin

2016-02-03 Thread Alexandr Dzhagriev
Hi Sebastian, Do you have any updates on the issue? I faced with pretty the same problem and disabling kryo + raising the spark.network.timeout up to 600s helped. So for my job it takes about 5 minutes to broadcast the variable (~5GB in my case) but then it's fast. I mean much faster than shufflin

Re: Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Alexandr Dzhagriev
On Mon, Feb 1, 2016 at 9:55 AM, Alexandr Dzhagriev > wrote: > >> Hi, >> >> That's another thing: that the Record case class should be outside. I ran >> it as spark-submit. >> >> Thanks, Alex. >> >> On Mon, Feb 1, 2016 at 6:41 PM, Ted

Re: Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Alexandr Dzhagriev
e.spark.sql.Dataset.(Dataset.scala:80) > at org.apache.spark.sql.Dataset.(Dataset.scala:91) > at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:488) > at > org.apache.spark.sql.SQLImplicits.rddToDatasetHolder(SQLImplicits.scala:71) > ... 53 elided > > On Mon, Feb 1, 2016 at 9:09 AM, Ale

Re: Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Alexandr Dzhagriev
nfun$checkAnalysis$ 1.org $apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:130) Thanks, Alex. On Mon, Feb 1, 2016 at 6:03 PM, Alexandr Dzhagriev wrote: > Hi Ted, > > That doesn't help neither as one method delegates to

Re: Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Alexandr Dzhagriev
Have you tried: > > agg(collect_list($"b") > > On Mon, Feb 1, 2016 at 8:50 AM, Alexandr Dzhagriev > wrote: > >> Hello, >> >> I'm trying to run the following example code: >> >> import org.apache.spark.sql.hive.HiveContext >&g

Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Alexandr Dzhagriev
Hello, I'm trying to run the following example code: import org.apache.spark.sql.hive.HiveContext import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.sql.functions._ case class RecordExample(a: Int, b: String) object ArrayExample { def main(args: Array[String]) { va

Re: Stream S3 server to Cassandra

2016-01-28 Thread Alexandr Dzhagriev
Hello Sateesh, I think you can use a file stream, e.g. streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory) to create a stream and then process the RDDs as you are doing now. Thanks, Alex. On Thu, Jan 28, 2016 at 10:56 AM, Sateesh Karuturi < sateesh.karutu...@gma