Re: java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true])

2016-08-16 Thread Sumit Khanna
This is just the stacktrace,but where is it you ccalling the UDF? Regards, Sumit On 16-Aug-2016 2:20 pm, "pseudo oduesp" wrote: > hi, > i cretae new columns with udf after i try to filter this columns : > i get this error why ? > > : java.lang.UnsupportedOperationException: Cannot evaluate exp

hdfs persist rollbacks when spark job is killed

2016-08-07 Thread Sumit Khanna
Hello, the use case is as follows : say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc (like a basic write to hdfs command), but say due to some reason or rhyme my job got killed, when the run was in the mid of it, meaning lets say I was only able to insert 100K rows when

silence the spark debug logs

2016-08-07 Thread Sumit Khanna
Hello, I dont want to print the all spark logs, but say a few only, e.g just the executions plans etc etc. How do I silence the spark debug ? Thanks, Sumit

Re: spark df schema to hive schema converter func

2016-08-06 Thread Sumit Khanna
wrt https://issues.apache.org/jira/browse/SPARK-5236. How do I also, usually convert something of type DecimalType to int/ string/ etc etc. Thanks, On Sun, Aug 7, 2016 at 10:33 AM, Sumit Khanna wrote: > Hi, > > was wondering if we have something like that takes as an argument a spark

spark df schema to hive schema converter func

2016-08-06 Thread Sumit Khanna
Hi, was wondering if we have something like that takes as an argument a spark df type e.g DecimalType(12,5) and converts it into the corresponding hive schema type. Double / Decimal / String ? Any ideas. Thanks,

Re: how to debug spark app?

2016-08-03 Thread Sumit Khanna
Am not really sure of the best practices on this , but I either consult the localhost:4040/jobs/ etc or better this : val customSparkListener: CustomSparkListener = new CustomSparkListener() sc.addSparkListener(customSparkListener) class CustomSparkListener extends SparkListener { override def o

Re: multiple spark streaming contexts

2016-08-01 Thread Sumit Khanna
onstruction for splitting your messages by > names in foreachRDD: > > lines.foreachRDD((recrdd, time: Time) => { > >recrdd.foreachPartition(part => { > > part.foreach(item_row => { > > if (item_row("table_name") == "kismia.orde

Re: multiple spark streaming contexts

2016-07-31 Thread Sumit Khanna
streaming contexts can all be handled well within a single jar. Guys please reply, Awaiting, Thanks, Sumit On Mon, Aug 1, 2016 at 12:24 AM, Sumit Khanna wrote: > Any ideas on this one guys ? > > I can do a sample run but can't be sure of imminent problems if any? How > can I ensure

Re: multiple spark streaming contexts

2016-07-31 Thread Sumit Khanna
Any ideas on this one guys ? I can do a sample run but can't be sure of imminent problems if any? How can I ensure different batchDuration etc etc in here, per StreamingContext. Thanks, On Sun, Jul 31, 2016 at 10:50 AM, Sumit Khanna wrote: > Hey, > > Was wondering if I could c

multiple spark streaming contexts

2016-07-30 Thread Sumit Khanna
Hey, Was wondering if I could create multiple spark stream contexts in my application (e.g instantiating a worker actor per topic and it has its own streaming context its own batch duration everything). What are the caveats if any? What are the best practices? Have googled half heartedly on the

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
the write output. > > > > That code below looks perfectly normal for writing a parquet file yes, > there shouldn’t be any tuning needed for “normal” performance. > > > > Thanks, > > Ewan > > > > *From:* Sumit Khanna [mailto:sumit.kha...@askme.in] >

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
lete hard disk or Intel Celeron may > be? > > Regards, > Gourav Sengupta > > On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna > wrote: > >> Hey, >> >> master=yarn >> mode=cluster >> >> spark.executor.memory=8g >> spark.rpc.netty.dispatcher

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
Hey, So I believe this is the right format to save the file, as in optimization is never in the write part, but with the head / body of my execution plan isnt it? Thanks, On Fri, Jul 29, 2016 at 11:57 AM, Sumit Khanna wrote: > Hey, > > master=yarn > mode=cluster > > spark.

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
> > My advise: Give HBase a shot. It gives UPSERT out of box. If you want > history, just add timestamp in the key (in reverse). Computation engines > easily support HBase. > > Best > Ayan > > On Fri, Jul 29, 2016 at 5:03 PM, Sumit Khanna > wrote: > >> Just

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
Just a note, I had the delta_df keys for the filter as in NOT INTERSECTION udf broadcasted to all the worker nodes. Which I think is an efficient move enough. Thanks, On Fri, Jul 29, 2016 at 12:19 PM, Sumit Khanna wrote: > Hey, > > the very first run : > > glossary : > >

correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-28 Thread Sumit Khanna
Hey, the very first run : glossary : delta_df := current run / execution changes dataframe. def deduplicate : apply windowing function and group by def partitionDataframe(delta_df) : get unique keys of that data frame and then return an array of data frames each containing just that very same

how to save spark files as parquets efficiently

2016-07-28 Thread Sumit Khanna
Hey, master=yarn mode=cluster spark.executor.memory=8g spark.rpc.netty.dispatcher.numThreads=2 All the POC on a single node cluster. the biggest bottle neck being : 1.8 hrs to save 500k records as a parquet file/dir executing this command : df.write.format("parquet").mode("overwrite").save(hdf