This is just the stacktrace,but where is it you ccalling the UDF?
Regards,
Sumit
On 16-Aug-2016 2:20 pm, "pseudo oduesp" wrote:
> hi,
> i cretae new columns with udf after i try to filter this columns :
> i get this error why ?
>
> : java.lang.UnsupportedOperationException: Cannot evaluate exp
Hello,
the use case is as follows :
say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc
(like a basic write to hdfs command), but say due to some reason or rhyme
my job got killed, when the run was in the mid of it, meaning lets say I
was only able to insert 100K rows when
Hello,
I dont want to print the all spark logs, but say a few only, e.g just the
executions plans etc etc. How do I silence the spark debug ?
Thanks,
Sumit
wrt https://issues.apache.org/jira/browse/SPARK-5236. How do I also,
usually convert something of type DecimalType to int/ string/ etc etc.
Thanks,
On Sun, Aug 7, 2016 at 10:33 AM, Sumit Khanna wrote:
> Hi,
>
> was wondering if we have something like that takes as an argument a spark
Hi,
was wondering if we have something like that takes as an argument a spark
df type e.g DecimalType(12,5) and converts it into the corresponding hive
schema type. Double / Decimal / String ?
Any ideas.
Thanks,
Am not really sure of the best practices on this , but I either consult the
localhost:4040/jobs/ etc
or better this :
val customSparkListener: CustomSparkListener = new CustomSparkListener()
sc.addSparkListener(customSparkListener)
class CustomSparkListener extends SparkListener {
override def o
onstruction for splitting your messages by
> names in foreachRDD:
>
> lines.foreachRDD((recrdd, time: Time) => {
>
>recrdd.foreachPartition(part => {
>
> part.foreach(item_row => {
>
> if (item_row("table_name") == "kismia.orde
streaming contexts can all
be handled well within a single jar.
Guys please reply,
Awaiting,
Thanks,
Sumit
On Mon, Aug 1, 2016 at 12:24 AM, Sumit Khanna wrote:
> Any ideas on this one guys ?
>
> I can do a sample run but can't be sure of imminent problems if any? How
> can I ensure
Any ideas on this one guys ?
I can do a sample run but can't be sure of imminent problems if any? How
can I ensure different batchDuration etc etc in here, per StreamingContext.
Thanks,
On Sun, Jul 31, 2016 at 10:50 AM, Sumit Khanna
wrote:
> Hey,
>
> Was wondering if I could c
Hey,
Was wondering if I could create multiple spark stream contexts in my
application (e.g instantiating a worker actor per topic and it has its own
streaming context its own batch duration everything).
What are the caveats if any?
What are the best practices?
Have googled half heartedly on the
the write output.
>
>
>
> That code below looks perfectly normal for writing a parquet file yes,
> there shouldn’t be any tuning needed for “normal” performance.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> *From:* Sumit Khanna [mailto:sumit.kha...@askme.in]
>
lete hard disk or Intel Celeron may
> be?
>
> Regards,
> Gourav Sengupta
>
> On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna
> wrote:
>
>> Hey,
>>
>> master=yarn
>> mode=cluster
>>
>> spark.executor.memory=8g
>> spark.rpc.netty.dispatcher
Hey,
So I believe this is the right format to save the file, as in optimization
is never in the write part, but with the head / body of my execution plan
isnt it?
Thanks,
On Fri, Jul 29, 2016 at 11:57 AM, Sumit Khanna
wrote:
> Hey,
>
> master=yarn
> mode=cluster
>
> spark.
>
> My advise: Give HBase a shot. It gives UPSERT out of box. If you want
> history, just add timestamp in the key (in reverse). Computation engines
> easily support HBase.
>
> Best
> Ayan
>
> On Fri, Jul 29, 2016 at 5:03 PM, Sumit Khanna
> wrote:
>
>> Just
Just a note, I had the delta_df keys for the filter as in NOT INTERSECTION
udf broadcasted to all the worker nodes. Which I think is an efficient move
enough.
Thanks,
On Fri, Jul 29, 2016 at 12:19 PM, Sumit Khanna
wrote:
> Hey,
>
> the very first run :
>
> glossary :
>
>
Hey,
the very first run :
glossary :
delta_df := current run / execution changes dataframe.
def deduplicate :
apply windowing function and group by
def partitionDataframe(delta_df) :
get unique keys of that data frame and then return an array of data frames
each containing just that very same
Hey,
master=yarn
mode=cluster
spark.executor.memory=8g
spark.rpc.netty.dispatcher.numThreads=2
All the POC on a single node cluster. the biggest bottle neck being :
1.8 hrs to save 500k records as a parquet file/dir executing this command :
df.write.format("parquet").mode("overwrite").save(hdf
17 matches
Mail list logo