Setting log level to DEBUG while keeping httpclient.wire on WARN

2018-06-29 Thread Daniel Haviv
Hi, I'm trying to debug an issue with Spark so I've set log level to DEBUG but at the same time I'd like to avoid the httpclient.wire's verbose output by setting it to WARN. I tried the following log4.properties config but I'm still getting DEBUG outputs for httpclient.wire: log4j.rootCategory=DE

Thrift server not exposing temp tables (spark.sql.hive.thriftServer.singleSession=true)

2018-05-30 Thread Daniel Haviv
Hi, I would like to expose a DF through the Thrift server, but even though I enable spark.sql.hive.thriftServer.singleSession I still can't see the temp table. I'm using Spark 2.2.0: spark-shell --conf spark.sql.hive.thriftServer.singleSession=true import org.apache.spark.sql.hive.thriftserver

Writing a UDF that works with an Interval in PySpark

2017-12-11 Thread Daniel Haviv
Hi, I'm trying to write a variant of date_add that accepts an interval as a second parameter so that I could use the following syntax with SparkSQL: select date_add(cast('1970-01-01' as date), interval 1 day) but I'm getting the following error: ValueError: (ValueError(u'Could not parse datatype:

Re: UDF issues with spark

2017-12-10 Thread Daniel Haviv
Some code would help to debug the issue On Fri, 8 Dec 2017 at 21:54 Afshin, Bardia < bardia.afs...@changehealthcare.com> wrote: > Using pyspark cli on spark 2.1.1 I’m getting out of memory issues when > running the udf function on a recordset count of 10 with a mapping of the > same value (arbirt

Writing custom Structured Streaming receiver

2017-11-01 Thread Daniel Haviv
Hi, Is there a guide to writing a custom Structured Streaming receiver? Thank you. Daniel

Optimizing dataset joins

2017-05-18 Thread Daniel Haviv
Hi, With RDDs it was possible to define a partitioner for two RDDS and given that two RDDs have the same partitioner, a join operation would be performed local to the partition without shuffling. Can dataset joins be optimized in the same way ? Is it enough to repartition two datasets on the the s

[Spark-SQL] Hive support is required to select over the following tables

2017-02-08 Thread Daniel Haviv
Hi, I'm using Spark 2.1.0 on Zeppelin. I can successfully create a table but when I try to select from it I fail: spark.sql("create table foo (name string)") res0: org.apache.spark.sql.DataFrame = [] spark.sql("select * from foo") org.apache.spark.sql.AnalysisException: Hive support is required

Re: Not per-key state in spark streaming

2016-12-08 Thread Daniel Haviv
There's no need to extend Spark's API, look at mapWithState for examples. On Thu, Dec 8, 2016 at 4:49 AM, Anty Rao wrote: > > > On Wed, Dec 7, 2016 at 7:42 PM, Anty Rao wrote: > >> Hi >> I'm new to Spark. I'm doing some research to see if spark streaming can >> solve my problem. I don't want to

Re: Not per-key state in spark streaming

2016-12-07 Thread Daniel Haviv
Hi Anty, What you could do is keep in the state only the existence of a key and when necessary pull it from a secondary state store like HDFS or HBASE. Daniel On Wed, Dec 7, 2016 at 1:42 PM, Anty Rao wrote: > Hi > I'm new to Spark. I'm doing some research to see if spark streaming can > solve m

Re: How to load only the data of the last partition

2016-11-17 Thread Daniel Haviv
Hi Samy, If you're working with hive you could create a partitioned table and update it's partitions' locations to the last version so when you'll query it using spark, you'll always get the latest version. Daniel On Thu, Nov 17, 2016 at 9:05 PM, Samy Dindane wrote: > Hi, > > I have some data p

Using mapWithState without a checkpoint

2016-11-17 Thread Daniel Haviv
Hi, Is it possible to use mapWithState without checkpointing at all ? I'd rather have the whole application fail, restart and reload an initialState RDD than pay for checkpointing every 10 batches. Thank you, Daniel

mapWithState job slows down & exceeds yarn's memory limits

2016-11-14 Thread Daniel Haviv
Hi, I have a fairly simple stateful streaming job that suffers from high GC and it's executors are killed as they are exceeding the size of the requested container. My current executor-memory is 10G, spark overhead is 2G and it's running with one core. At first the job begins running at a rate tha

Re: mapWithState with Datasets

2016-11-08 Thread Daniel Haviv
Scratch that, it's working fine. Thank you. On Tue, Nov 8, 2016 at 8:19 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I should have used transform instead of map > > val x: DStream[(String, Record)] = > kafkaStream.transform(x=&

Re: mapWithState with Datasets

2016-11-08 Thread Daniel Haviv
On Tue, Nov 8, 2016 at 7:46 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] : > > val kafkaStream = KafkaUtils.createDirectStream[String, String, > StringDecoder, StringDe

mapWithState with Datasets

2016-11-08 Thread Daniel Haviv
Hi, I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] : val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet) val stateStream: DStream[RDD[(String, Record)]] = kafkaStream.map(x=> { sqlContext.read.json(x._2

mapWithState with a big initial RDD gets OOM'ed

2016-11-07 Thread Daniel Haviv
Hi, I have a stateful streaming app where I pass a rather large initialState RDD at the beginning. No matter to how many partitions I divide the stateful stream I keep failing on OOM or Java heap space. Is there a way to make it more resilient? how can I control it's storage level? This is basica

mapWithState and DataFrames

2016-11-06 Thread Daniel Haviv
Hi, How can I utilize mapWithState and DataFrames? Right now I stream json messages from Kafka, update their state, output the updated state as json and compose a dataframe from it. It seems inefficient both in terms of processing and storage (a long string instead of a compact DF). Is there a way

Re: Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread Daniel Haviv
Hi Baki, It's enough for the producer to write the messages compressed. See here: https://cwiki.apache.org/confluence/display/KAFKA/Compression Thank you. Daniel > On 3 Nov 2016, at 21:27, Daniel Haviv wrote: > > Hi, > Kafka can compress/uncompress your messages for you s

Re: Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread Daniel Haviv
Hi, Kafka can compress/uncompress your messages for you seamlessly, adding compression on top of that will be redundant. Thank you. Daniel > On 3 Nov 2016, at 20:53, bhayat wrote: > > Hello, > > I really wonder that whether i can stream compressed data with using > KafkaUtils.createDirectStre

error: Unable to find encoder for type stored in a Dataset. when trying to map through a DataFrame

2016-11-02 Thread Daniel Haviv
Hi, I have the following scenario: scala> val df = spark.sql("select * from danieltest3") df: org.apache.spark.sql.DataFrame = [iid: string, activity: string ... 34 more fields] Now I'm trying to map through the rows I'm getting: scala> df.map(r=>r.toSeq) :32: error: Unable to find encoder for ty

partitionBy produces wrong number of tasks

2016-10-19 Thread Daniel Haviv
Hi, I have a case where I use partitionBy to write my DF using a calculated column, so it looks somethings like this: val df = spark.sql("select *, from_unixtime(ts, 'MMddH') partition_key from mytable") df.write.partitionBy("partition_key").orc("/partitioned_table") df is 8 partitions in s

Using DirectOutputCommitter with ORC

2016-07-25 Thread Daniel Haviv
Hi, How can the DirectOutputCommitter be utilized for writing ORC files? I tried setting it via: sc.getConf.set("spark.hadoop.mapred.output.committer.class","com.veracity-group.datastage.directorcwriter") But I can still see a _temporary directory being used when I save my dataframe as ORC. Tha

Spark (on Windows) not picking up HADOOP_CONF_DIR

2016-07-17 Thread Daniel Haviv
Hi, I'm running Spark using IntelliJ on Windows and even though I set HADOOP_CONF_DIR it does not affect the contents of sc.hadoopConfiguration. Anybody encountered it ? Thanks, Daniel

Re: Spark Streaming - Best Practices to handle multiple datapoints arriving at different time interval

2016-07-16 Thread Daniel Haviv
Or use mapWithState Thank you. Daniel > On 16 Jul 2016, at 03:02, RK Aduri wrote: > > You can probably define sliding windows and set larger batch intervals. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Best-Practices-to-h

Spark Streaming: Refreshing broadcast value after each batch

2016-07-12 Thread Daniel Haviv
Hi, I have a streaming application which uses a broadcast variable which I populate from a database. I would like every once in a while (or even every batch) to update/replace the broadcast variable with the latest data from the database. Only way I found online to do this is this "hackish" way (

Confusion regarding sc.accumulableCollection(mutable.ArrayBuffer[String]()) type

2016-06-23 Thread Daniel Haviv
Hi, I want to to use an accumulableCollection of type mutable.ArrayBuffer[String ] to return invalid records detected during transformations but I don't quite get it's type: val errorAccumulator: Accumulable[ArrayBuffer[String], String] = sc.accumulableCollection(mutable.ArrayBuffer[String]()) W

Re: Switching broadcast mechanism from torrrent

2016-06-20 Thread Daniel Haviv
2016 at 7:10 AM, Takeshi Yamamuro > wrote: > >> How about using `transient` annotations? >> >> // maropu >> >> On Sun, Jun 19, 2016 at 10:51 PM, Daniel Haviv < >> daniel.ha...@veracity-group.com> wrote: >> >>> Hi, >>>

Re: Switching broadcast mechanism from torrrent

2016-06-19 Thread Daniel Haviv
find a root cause only from the stacktraces... > > // maropu > > > > > On Mon, Jun 6, 2016 at 2:14 AM, Daniel Haviv < > daniel.ha...@veracity-group.com> wrote: > >> Hi, >> I've set spark.broadcast.factory to >> org.apache.spark.broadcast.HttpBro

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-10 Thread Daniel Haviv
I'm using EC2 instances Thank you. Daniel > On 9 Jun 2016, at 16:49, Gourav Sengupta wrote: > > Hi, > > are you using EC2 instances or local cluster behind firewall. > > > Regards, > Gourav Sengupta > >> On Wed, Jun 8, 2016 at 4:34 PM, Daniel

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv
Hi, I've set these properties both in core-site.xml and hdfs-site.xml with no luck. Thank you. Daniel > On 9 Jun 2016, at 01:11, Steve Loughran wrote: > > >> On 8 Jun 2016, at 16:34, Daniel Haviv >> wrote: >> >> Hi, >> I'm trying to creat

HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv
Hi, I'm trying to create a table on s3a but I keep hitting the following error: Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain) I t

groupByKey returns an emptyRDD

2016-06-06 Thread Daniel Haviv
Hi, I'm wrapped the following code into a jar: val test = sc.parallelize(Seq(("daniel", "a"), ("daniel", "b"), ("test", "1)"))) val agg = test.groupByKey() agg.collect.foreach(r=>{println(r._1)}) The result of groupByKey is an empty RDD, when I'm trying the same code using the spark-shell it's

Re: Switching broadcast mechanism from torrrent

2016-06-06 Thread Daniel Haviv
o switch broadcast > method. > > Can you describe the issues with torrent broadcast in more detail ? > > Which version of Spark are you using ? > > Thanks > > On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv < > daniel.ha...@veracity-group.com> wrote: > >> H

Switching broadcast mechanism from torrrent

2016-06-01 Thread Daniel Haviv
Hi, Our application is failing due to issues with the torrent broadcast, is there a way to switch to another broadcast method ? Thank you. Daniel

Re: "collecting" DStream data

2016-05-15 Thread Daniel Haviv
on. Also, you can’t update the > value of a broadcast variable, since it’s immutable. > > Thanks, > Silvio > > From: Daniel Haviv > Date: Sunday, May 15, 2016 at 6:23 AM > To: user > Subject: "collecting" DStream data > > Hi, > I have a DStream I&#x

"collecting" DStream data

2016-05-15 Thread Daniel Haviv
Hi, I have a DStream I'd like to collect and broadcast it's values. To do so I've created a mutable HashMap which i'm filling with foreachRDD but when I'm checking it, it remains empty. If I use ArrayBuffer it works as expected. This is my code: val arr = scala.collection.mutable.HashMap.empty[St

java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B

2016-05-11 Thread Daniel Haviv
Hi, I'm running a very simple job (textFile->map->groupby->count) and hitting this with Spark 1.6.0 on EMR 4.3 (Hadoop 2.7.1) and hitting this exception when running on yarn-client and not in local mode: 16/05/11 10:29:26 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, ip-172-31-33-97.

Re: Is it a bug?

2016-05-09 Thread Daniel Haviv
How come that for the first() function it calculates an updated value and for collect it doesn't ? On Sun, May 8, 2016 at 4:17 PM, Ted Yu wrote: > I don't think so. > RDD is immutable. > > > On May 8, 2016, at 2:14 AM, Sisyphuss wrote: > > > >

Re: Save DataFrame to HBase

2016-04-27 Thread Daniel Haviv
CDH 5.5.2 installed which > comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work? > > Thanks, > Ben > >> On Apr 24, 2016, at 1:43 AM, Daniel Haviv >> wrote: >> >> Hi, >> I tried saving DF to HBase using a hive table with hba

Re: Save DataFrame to HBase

2016-04-24 Thread Daniel Haviv
Hi, I tried saving DF to HBase using a hive table with hbase storage handler and hiveContext but it failed due to a bug. I was able to persist the DF to hbase using Apache Pheonix which was pretty simple. Thank you. Daniel > On 21 Apr 2016, at 16:52, Benjamin Kim wrote: > > Has anyone found

Spark fileStream from a partitioned hive dir

2016-04-13 Thread Daniel Haviv
Hi, We have a hive table which gets data written to it by two partition keys, day and hour. We would like to stream the incoming files assince fileStream can only listen on one directory we start a streaming job on the latest partition and every hour kill it and start a new one on a newer partition

Re: aggregateByKey on PairRDD

2016-03-30 Thread Daniel Haviv
Hi, shouldn't groupByKey be avoided ( https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html) ? Thank you,. Daniel On Wed, Mar 30, 2016 at 9:01 AM, Akhil Das wrote: > Isn't it what tempRDD.groupByKey does? > > Thanks > Best

Flume with Spark Streaming Sink

2016-03-20 Thread Daniel Haviv
Hi, I'm trying to use the Spark Sink with Flume but it seems I'm missing some of the dependencies. I'm running the following code: ./bin/spark-shell --master yarn --jars /home/impact/flumeStreaming/spark-streaming-flume_2.10-1.6.1.jar,/home/impact/flumeStreaming/flume-ng-core-1.6.0.jar,/home/impac

One task hangs and never finishes

2015-12-17 Thread Daniel Haviv
Hi, I have an application running a set of transformations and finishes with saveAsTextFile. Out of 80 tasks all finish pretty fast but one that just hangs and outputs these message to STDERR: 5/12/17 17:22:19 INFO collection.ExternalAppendOnlyMap: Thread 82 spilling in-memory map of 4.0 GB to d

Re: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Daniel Haviv
I will Thank you. > On 27 באוק׳ 2015, at 4:54, Felix Cheung wrote: > > Please open a JIRA? > > > Date: Mon, 26 Oct 2015 15:32:42 +0200 > Subject: HiveContext ignores ("skip.header.line.count"="1") > From: daniel.ha...@veracity-group.com > To: user@spark.apache.org > > Hi, > I have a csv tab

HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Daniel Haviv
Hi, I have a csv table in Hive which is configured to skip the header row using TBLPROPERTIES("skip.header.line.count"="1"). When querying from Hive the header row is not included in the data, but when running the same query via HiveContext I get the header row. I made sure that HiveContext sees t

Generated ORC files cause NPE in Hive

2015-10-13 Thread Daniel Haviv
Hi, We are inserting streaming data into a hive orc table via a simple insert statement passed to HiveContext. When trying to read the files generated using Hive 1.2.1 we are getting NPE: at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91) at or

Re: SQLContext within foreachRDD

2015-10-12 Thread Daniel Haviv
ks as expected. > > Check this out for code example: > https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzerStreamingSQL.scala > > -adrian > > From: Daniel Haviv > Date: Monday, October

SQLContext within foreachRDD

2015-10-12 Thread Daniel Haviv
Hi, As things that run inside foreachRDD run at the driver, does that mean that if we use SQLContext inside foreachRDD the data is sent back to the driver and only then the query is executed or is it executed at the executors? Thank you. Daniel

Re: Insert via HiveContext is slow

2015-10-09 Thread Daniel Haviv
in > multi-inserts. > > A workaround is create a temp cached table for the projection first, and > then do the multiple inserts base on the cached table. > > > > We are actually working on the POC of some similar cases, hopefully it > comes out soon. > >

Re: Insert via HiveContext is slow

2015-10-08 Thread Daniel Haviv
dsParamLine['prefecber'] prefecber, dsParamLine['postfecber'] postfecber, dsParamLine['sigqrxmer'] sigqrxmer, dsParamLine['sigqmicroreflection'] sigqmicroreflection, dsParamLin

Insert via HiveContext is slow

2015-10-08 Thread Daniel Haviv
Hi, I'm inserting into a partitioned ORC table using an insert sql statement passed via HiveContext. The performance I'm getting is pretty bad and I was wondering if there are ways to speed things up. Would saving the DF like this df.write().mode(SaveMode.Append).partitionBy("date").saveAsTable("Ta

Converting a DStream to schemaRDD

2015-09-29 Thread Daniel Haviv
Hi, I have a DStream which is a stream of RDD[String]. How can I pass a DStream to sqlContext.jsonRDD and work with it as a DF ? Thank you. Daniel

Re: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Daniel Haviv
I tried but I'm getting the same error (task not serializable) > On 25 בספט׳ 2015, at 20:10, Ted Yu wrote: > > Is the Schema.parse() call expensive ? > > Can you call it in the closure ? > >> On Fri, Sep 25, 2015 at 10:06 AM, Daniel Haviv >&

java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Daniel Haviv
Hi, I'm getting a NotSerializableException even though I'm creating all the my objects from within the closure: import org.apache.avro.generic.GenericDatumReader import java.io.File import org.apache.avro._ val orig_schema = Schema.parse(new File("/home/wasabi/schema")) val READER = new Gen

Reading avro data using KafkaUtils.createDirectStream

2015-09-24 Thread Daniel Haviv
Hi, I'm trying to use KafkaUtils.createDirectStream to read avro messages from Kafka but something is off with my type arguments: val avroStream = KafkaUtils.createDirectStream[AvroKey[GenericRecord], GenericRecord, NullWritable, AvroInputFormat[GenericRecord]](ssc, kafkaParams, topicSet) I'm get

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv
lmost any file system > > El martes, 22 de septiembre de 2015, Daniel Haviv > escribió: >> Hi, >> We are trying to load around 10k avro files (each file holds only one >> record) using spark-avro but it takes over 15 minutes to load. >> It seems that most of the wo

spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv
Hi, We are trying to load around 10k avro files (each file holds only one record) using spark-avro but it takes over 15 minutes to load. It seems that most of the work is being done at the driver where it created a broadcast variable for each file. Any idea why is it behaving that way ? Thank you.

sqlContext.read.avro broadcasting files from the driver

2015-09-21 Thread Daniel Haviv
Hi, I'm loading a 1000 files using the spark-avro package: val df = sqlContext.read.avro(*"/incoming/"*) When I'm performing an action on this df it seems like for each file a broadcast is being created and is sent to the workers (instead of the workers reading their data-local files): scala> df.

Re: Spark Thrift Server JDBC Drivers

2015-09-17 Thread Daniel Haviv
EMR, but it's worth a try. > > I've also used the ODBC driver from Hortonworks > <http://hortonworks.com/products/releases/hdp-2-2/#add_ons>. > > Regards, > Dan > > On Wed, Sep 16, 2015 at 8:34 AM, Daniel Haviv < > daniel.ha...@veracity-group.com> wrote: &

Spark Thrift Server JDBC Drivers

2015-09-16 Thread Daniel Haviv
Hi, are there any free JDBC drivers for thrift ? The only ones I could find are Simba's which require a license. Thank, Daniel

Re: Parsing Avro from Kafka Message

2015-09-04 Thread Daniel Haviv
> NullWritable, AvroKeyInputFormat[GenericRecord]](..) > > val avroData = avroStream.map(x => x._1.datum().toString) > > > Thanks > Best Regards > >> On Thu, Sep 3, 2015 at 6:17 PM, Daniel Haviv >> wrote: >> Hi, >> I'm reading messages f

Parsing Avro from Kafka Message

2015-09-03 Thread Daniel Haviv
Hi, I'm reading messages from Kafka where the value is an avro file. I would like to parse the contents of the message and work with it as a DataFrame, like with the spark-avro package but instead of files, pass it a RDD. How can this be achieved ? Thank you. Daniel

Starting a service with Spark Executors

2015-08-09 Thread Daniel Haviv
Hi, I'd like to start a service with each Spark Executor upon initalization and have the disributed code reference that service locally. What I'm trying to do is to invoke torch7 computations without reloading the model for each row by starting Lua http handler that will recieve http requests for e

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-06 Thread Daniel Haviv
rtWithContext(sqlContext) > } > > Again, I'm not really clear what your use case is, but it does sound like > the first link above is what you may want. > > -Todd > > On Wed, Aug 5, 2015 at 1:57 PM, Daniel Haviv < > daniel.ha...@veracity-group.com> wrote: > >> Hi, >> Is it possible to start the Spark SQL thrift server from with a streaming >> app so the streamed data could be queried as it's goes in ? >> >> Thank you. >> Daniel >> > >

Starting Spark SQL thrift server from within a streaming app

2015-08-05 Thread Daniel Haviv
Hi, Is it possible to start the Spark SQL thrift server from with a streaming app so the streamed data could be queried as it's goes in ? Thank you. Daniel

Re: Local Repartition

2015-07-20 Thread Daniel Haviv
parallelism so you need to be aware of that as you decide when to call > coalesce. > > Thanks, > Silvio > > From: Daniel Haviv > Date: Monday, July 20, 2015 at 4:59 PM > To: Doug Balog > Cc: user > Subject: Re: Local Repartition > > Thanks Doug, > coalesce mig

Re: Local Repartition

2015-07-20 Thread Daniel Haviv
num executors * 10, but I’m still > trying to figure out the > optimal number of partitions per executor. > To get the number of executors, > sc.getConf.getInt(“spark.executor.instances”,-1) > > > Cheers, > > Doug > > > On Jul 20, 2015, at 5:04 AM, Daniel Haviv

Local Repartition

2015-07-20 Thread Daniel Haviv
Hi, My data is constructed from a lot of small files which results in a lot of partitions per RDD. Is there some way to locally repartition the RDD without shuffling so that all of the partitions that reside on a specific node will become X partitions on the same node ? Thank you. Daniel

DataFrame insertInto fails, saveAsTable works (Azure HDInsight)

2015-07-09 Thread Daniel Haviv
Hi, I'm running Spark 1.4 on Azure. DataFrame's insertInto fails, but when saveAsTable works. It seems like some issue with accessing Azure's blob storage but that doesn't explain why one type of write works and the other doesn't. This is the stack trace: Caused by: org.apache.hadoop.fs.azure.Azu

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-08 Thread Daniel Haviv
>> >>> On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv < >>> daniel.ha...@veracity-group.com> wrote: >>> >>>> Hi, >>>> I'm trying to start the thrift-server and passing it azure's blob >>>> storage jars

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
workaround. > >> On 3 Jul 2015, at 08:33, Daniel Haviv >> wrote: >> >> Thanks >> I was looking for a less hack-ish way :) >> >> Daniel >> >>> On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das >>> wrote: >>> With binary i

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023> > which initializes the SQLContext. > > Thanks > Best Regards > > On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv < > daniel.ha...@veracity-group.com> wrote: > >> Hi, >&

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Daniel Haviv
; > On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv < > daniel.ha...@veracity-group.com> wrote: > >> Hi, >> I'm trying to start the thrift-server and passing it azure's blob storage >> jars but I'm failing on

thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Daniel Haviv
Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.creat

Starting Spark without automatically starting HiveContext

2015-07-02 Thread Daniel Haviv
Hi, I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start the spark-shell it always start with HiveContext. How can I disable the HiveContext from being initialized automatically ? Thanks, Daniel

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv
step guide in how to setup and use Spark in > HDInsight. > > https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-spark-install/ > > Jacob > > From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com] > Sent: Thursday, June 25, 2015 3

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv
ell. The Power BI library is up > at http://spark-packages.org/package/granturing/spark-power-bi the Event Hubs > library should be up soon. > > Thanks, > Silvio > > From: Daniel Haviv > Date: Thursday, June 25, 2015 at 1:37 PM > To: "user@spark.apache.org" &g

Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv
Hi, I'm trying to use spark over Azure's HDInsight but the spark-shell fails when starting: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.

Saving RDDs as custom output format

2015-04-14 Thread Daniel Haviv
Hi, Is it possible to store RDDs as custom output formats, For example ORC? Thanks, Daniel

Ensuring data locality when opening files

2015-03-09 Thread Daniel Haviv
Hi, We wrote a spark steaming app that receives file names on HDFS from Kafka and opens them using Hadoop's libraries. The problem with this method is that I'm not utilizing data locality because any worker might open any file without giving precedence to data locality. I can't open the files using

using sparkContext from within a map function (from spark streaming app)

2015-03-08 Thread Daniel Haviv
Hi, We are designing a solution which pulls file paths from Kafka and for the current stage just counts the lines in each of these files. When running the code it fails on: Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureClea

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Daniel Haviv
Quoting Michael: Predicate push down into the input format is turned off by default because there is a bug in the current parquet library that null pointers when there are full row groups that are null. https://issues.apache.org/jira/browse/SPARK-4258 You can turn it on if you want: http://spa

Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-17 Thread Daniel Haviv
gt; >> > On Wed Dec 17 2014 at 2:09:56 AM Kyle Lin wrote: >> >> >> >> >> >> I also got the same problem.. >> >> >> >> 2014-12-09 22:58 GMT+08:00 Daniel Haviv : >> >>> >> >>> Hi, >> >>

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
ents? > >> On 16 Dec 2014 22:46, "Daniel Haviv" wrote: >> I've added every jar in the lib dir to my classpath and still no luck: >> >> CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-core-

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
what are the contents? > On 16 Dec 2014 22:46, "Daniel Haviv" wrote: > >> I've added every jar in the lib dir to my classpath and still no luck: >> >> >> CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
ark-1/lib/datanucleus-core-3.2.2.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-rdbms-3.2.1.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-api-jdo-3.2.1.jar > > Thanks > Best Regards > > On Tue, Dec 16, 2014 at 10:33 PM, Daniel Haviv > wrote: >> &g

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
nks > Best Regards > > On Tue, Dec 16, 2014 at 9:33 PM, Daniel Haviv > wrote: >> >> That's the first thing I tried... still the same error: >> hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib >> hdfs@ams-rsrv01:~$ cd /tmp/spark/spark-branch-1.1

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
Regards > > On Tue, Dec 16, 2014 at 5:04 PM, Daniel Haviv > wrote: >> >> Hi, >> I've built spark successfully with maven but when I try to run >> spark-shell I get the following errors: >> >> Spark assembly has been built wi

Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
Hi, I've built spark successfully with maven but when I try to run spark-shell I get the following errors: Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/deploy/SparkSubmit Caused by: j

Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-09 Thread Daniel Haviv
Hi, I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I get the following exception: 14/12/09 06:54:24 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 14/12/09 06:54:24 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 14/12/09 0

Re: RDD saveAsObjectFile write to local file and HDFS

2014-11-26 Thread Daniel Haviv
Prepend file:// to the path Daniel > On 26 בנוב׳ 2014, at 20:15, firemonk9 wrote: > > When I am running spark locally, RDD saveAsObjectFile writes the file to > local file system (ex : path /data/temp.txt) > and > when I am running spark on YARN cluster, RDD saveAsObjectFile writes the > file

Re: Remapping columns from a schemaRDD

2014-11-26 Thread Daniel Haviv
ows. If you > wan to return a struct from a UDF you can do that with a case class. > > On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv > wrote: > >> Thank you. >> >> How can I address more complex columns like maps and structs? >> >> Thanks again

Starting the thrift server

2014-11-26 Thread Daniel Haviv
Hi, I'm trying to start the thrift server but failing: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:353) at org.apache.spark.sql.hive.HiveContext$$anonfu

Re: Remapping columns from a schemaRDD

2014-11-25 Thread Daniel Haviv
he > applySchema function. > > I'll note that, as part of the ML pipeline work, we have been considering > adding something like: > > def modifyColumn(columnName, function) > > Any comments anyone has on this interface would be appreciated! > > Michael > &

Remapping columns from a schemaRDD

2014-11-25 Thread Daniel Haviv
Hi, I'm selecting columns from a json file, transform some of them and would like to store the result as a parquet file but I'm failing. This is what I'm doing: val jsonFiles=sqlContext.jsonFile("/requests.loading") jsonFiles.registerTempTable("jRequests") val clean_jRequests=sqlContext.sql("sel

Re: Unable to use Kryo

2014-11-25 Thread Daniel Haviv
The problem was I didn't use the correct class name, it should be org.apache.spark.*serializer*.KryoSerializer On Mon, Nov 24, 2014 at 11:12 PM, Daniel Haviv wrote: > Hi, > I want to test Kryo serialization but when starting spark-shell I'm > hitting

Re: Merging Parquet Files

2014-11-25 Thread Daniel Haviv
to/parquet' > ) > > This will at least parallelize the retrieval of file status object, but > there is a lot more optimization that I hope to do. > > On Sat, Nov 22, 2014 at 1:53 PM, Daniel Haviv > wrote: > >> Hi, >> I'm ingesting a lot of small JS

Unable to use Kryo

2014-11-24 Thread Daniel Haviv
Hi, I want to test Kryo serialization but when starting spark-shell I'm hitting the following error: java.lang.ClassNotFoundException: org.apache.spark.KryoSerializer the kryo-2.21.jar is on the classpath so I'm not sure why it's not picking it up. Thanks for your help, Daniel

Converting a column to a map

2014-11-23 Thread Daniel Haviv
Hi, I have a column in my schemaRDD that is a map but I'm unable to convert it to a map.. I've tried converting it to a Tuple2[String,String]: val converted = jsonFiles.map(line=> { line(10).asInstanceOf[Tuple2[String,String]]}) but I get ClassCastException: 14/11/23 11:51:30 WARN scheduler.TaskSe

  1   2   >