Does Spark Streaming support streaming from a database table?

2015-07-13 Thread unk1102
Hi I did Kafka streaming through Spark streaming I have a use case where I would like to stream data from a database table. I see JDBCRDD is there but that is not what I am looking for I need continuous streaming like JavaSparkStreaming which continuously runs and listens to changes in a database t

How to maintain multiple JavaRDD created within another method like javaStreamRDD.forEachRDD

2015-07-14 Thread unk1102
I use Spark Streaming where messages read from Kafka topics are stored into JavaDStream this rdd contains actual data. Now after going through documentation and other help I have found we traverse JavaDStream using foreachRDD javaDStreamRdd.foreachRDD(new Function,Void>() { public void call(Ja

Re: Store DStreams into Hive using Hive Streaming

2015-07-17 Thread unk1102
Hi I have similar use case did you found solution for this problem of loading DStreams in Hive using Spark Streaming. Please guide. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Store-DStreams-into-Hive-using-Hive-Streaming-tp18307p23885.html Sent

How to avoid empty unavoidable group by keys in DataFrame?

2016-05-21 Thread unk1102
Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One reason I see is my data is skew some of my group by

Does DataFrame has something like set hive.groupby.skewindata=true;

2016-05-21 Thread unk1102
Hi I am having DataFrame with huge skew data in terms of TB and I am doing groupby on 8 fields which I cant avoid unfortunately. I am looking to optimize this I have found hive has set hive.groupby.skewindata=true; I dont use Hive I have Spark DataFrame can we achieve above Spark? Please guide. T

How to change Spark DataFrame groupby("col1",..,"coln") into reduceByKey()?

2016-05-22 Thread unk1102
Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One reason I see is my data is skew some of my group by

Best practices to restart Spark jobs programatically from driver itself

2016-07-20 Thread unk1102
Hi I have multiple long running spark jobs which many times hangs because of multi tenant Hadoop cluster and resource scarcity. I am thinking of restarting spark job within driver itself. For e.g. if spark job does not write output files for say 30 minutes then I want to restart spark job by itself

How to give name to Spark jobs shown in Spark UI

2016-07-23 Thread unk1102
Hi I have multiple child spark jobs run at a time. Is there any way to name these child spark jobs so I can identify slow running ones. For e. g. xyz_saveAsTextFile(), abc_saveAsTextFile() etc please guide. Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001

Re: How to give name to Spark jobs shown in Spark UI

2016-07-27 Thread unk1102
Thank Rahul I think you didn't read question properly I have one main spark job which I name using the approach you described. As part of main spark job I create multiple threads which essentially becomes child spark jobs and those jobs has no direct way of naming. On Jul 27, 2016 11:17, "rahulkum

How to call mapPartitions on DataFrame?

2015-12-23 Thread unk1102
Hi I have the following code where I use mapPartitions on RDD but then I need to convert it into DataFrame so why do I need to convert DataFrame into RDD and back into DataFrame for just calling mapPartitions why can I call it directly on DataFrame? sourceFrame.toJavaRDD().mapPartitions(new FlatM

Spark DataFrame callUdf does not compile?

2015-12-28 Thread unk1102
Hi I am trying to invoke Hive UDF using dataframe.select(callUdf("percentile_approx",col("C1"),lit(0.25))) but it does not compile however same call works in Spark scala console I dont understand wh

How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread unk1102
Hi I have a Spark job which hangs for around 7 hours or more than that until jobs killed out by Autosys because of time out. Data is not huge I am sure it stucks because of GC but I cant find source code which causes GC I am reusing almost all variable trying to minimize creating local objects thou

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread unk1102
Sorry please see attached waiting thread log -- View this message in context: http://apache-sp

Spark on Apache Ingnite?

2016-01-05 Thread unk1102
Hi has anybody tried and had success with Spark on Apache Ignite seems promising? https://ignite.apache.org/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-tp25884.html Sent from the Apache Spark User List mailing list archive at Nab

coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread unk1102
hi I am trying to save many partitions of Dataframe into one CSV file and it take forever for large data sets of around 5-6 GB. sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop") For small data above code works well but for large data it hangs f

What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread unk1102
Hi As part of Spark 1.6 release what should be ideal value or unit for spark.memory.offheap.size I have set as 5000 I assume it will be 5GB is it correct? Please guide. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-ideal-value-unit-for-

Why is this job running since one hour?

2016-01-06 Thread unk1102
Hi I have one main Spark job which spawns multiple child spark jobs. One of the child spark job is running for an hour and it keeps on hanging there I have taken snap shot please see -- View this

Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread unk1102
Hi I was using Spark 1.5 with Tungsten sort and now I have using Spark 1.6 I dont see any difference I was expecting Spark 1.6 to be faster. Anyways do we need to enable Tunsten and unsafe options or they are enabled by default I see in documentation that default sort manager is sort I though it is

How to optimiz and make this code faster using coalesce(1) and mapPartitionIndex

2016-01-12 Thread unk1102
Hi I have the following code which I run as part of thread which becomes child job of my main Spark job it takes hours to run for large data around 1-2GB because of coalesce(1) and if data is in MB/KB then it finishes faster with more data sets size sometimes it does not complete at all. Please gui

Re: How to optimiz and make this code faster using coalesce(1) and mapPartitionIndex

2016-01-13 Thread unk1102
Hi thanks for the reply. Actually I cant share details as it is classified and pretty complex to understand as it is not general problem I am trying to solve related to database dynamic sql order execution. I need to use Spark as my other jobs which dont use coalesce uses spark. My source data is h

Can we use localIterator when we need to process data in one partition?

2016-01-14 Thread unk1102
Hi I have special requirement when I need to process data in one partition at the last after doing many filtering,updating etc in a DataFrame. Currently to process data in one partition I am using coalesce(1) which is killing and painfully slow my jobs hangs for hours even 5-6 hours and I dont kno

Spark MLLlib Ideal way to convert categorical features into LabeledPoint RDD?

2016-02-01 Thread unk1102
Hi I have dataset which is completely categorical and it does not contain even one column as numerical. Now I want to apply classification using Naive Bayes I have to predict whether given alert is actionable or not using YES/NO I have the following example of my dataset DayOfWeek(int),AlertType(S

Spark Streaming with Druid?

2016-02-06 Thread unk1102
Hi did anybody tried Spark Streaming with Druid as low latency store? Combination seems powerful is it worth trying both together? Please guide and share your experience. I am after creating the best low latency streaming analytics. -- View this message in context: http://apache-spark-user-list

jssc.textFileStream(directory) how to ensure it read entire all incoming files

2016-02-09 Thread unk1102
Hi my actual use case is streaming text files in HDFS directory and send it to Kafka please let me know if is there any existing solution for this. Anyways I have the following code //lets assume directory contains one file a.txt and it has 100 lines JavaDStream logData = jssc.textFileStream(dir

Re: how to send JavaDStream RDD using foreachRDD using Java

2016-02-09 Thread unk1102
Hi Sachin, how did you write to Kafka from Spark I cant find the following method sendString and sendDataAsString in KafkaUtils can you please guide? KafkaUtil.sendString(p,topic,result.get(0)); KafkaUtils.sendDataAsString(MTP,topicName, result.get(0)); -- View this message in context: http

Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread unk1102
Hi I have spark job which I run on yarn and sometimes it behaves in weird manner it shows negative no of tasks in few executors and I keep on loosing executors I also see no of executors are more than I requested. My job is highly tuned not getting OOM or any problem. It is just YARN behaves in a w

Best practices to call small spark jobs as part of REST api

2015-09-29 Thread unk1102
Hi I would like to know any best practices to call spark jobs in rest api. My Spark jobs returns results as json and that json can be used by UI application. Should we even have direct HDFS/Spark backend layer in UI for on demand queries? Please guide. Thanks much. -- View this message in conte

Hive ORC Malformed while loading into spark data frame

2015-09-29 Thread unk1102
Hi I have a spark job which creates hive tables in orc format with partitions. It works well I can read data back into hive table using hive console. But if I try further process orc files generated by Spark job by loading into dataframe then I get the following exception Caused by: java.io.IOExc

How to save DataFrame as a Table in Hbase?

2015-10-01 Thread unk1102
Hi anybody tried to save DataFrame in HBase? I have processed data in DataFrame which I need to store in HBase so that my web ui can access it from Hbase? Please guide. Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-DataFrame-

How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread unk1102
Hi I have registed my hive UDF using the following code: hiveContext.udf().register("MyUDF",new UDF1(String,String)) { public String call(String o) throws Execption { //bla bla } },DataTypes.String); Now I want to use above MyUDF in DataFrame. How do we use it? I know how to use it in a sql and i

How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread unk1102
Hi I have couple of Spark jobs which uses group by query which is getting fired from hiveContext.sql() Now I know group by is evil but my use case I cant avoid group by I have around 7-8 fields on which I need to do group by. Also I am using df1.except(df2) which also seems heavy operation and does

Can we using Spark Streaming to stream data from Hive table partitions?

2015-10-03 Thread unk1102
Hi I have couple of Spark jobs which reads Hive table partitions data and processes it independently in different threads in a driver. Now data to process is huge in terms of TB my jobs are not scaling and running slow. So I am thinking to use Spark Streaming as and when data is added into Hive par

ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread unk1102
Hi I have a spark job which creates ORC files in partitions using the following code dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").save("baseTable"); Above code creates successfully orc files which is readable in Spark dataframe But when I try to load orc f

How to avoid Spark shuffle spill memory?

2015-10-06 Thread unk1102
Hi I have a Spark job which runs for around 4 hours and it shared SparkContext and runs many child jobs. When I see each job in UI I see shuffle spill of around 30 to 40 GB and because of that many times executors gets lost because of using physical memory beyond limits how do I avoid shuffle spill

How to increase Spark partitions for the DataFrame?

2015-10-08 Thread unk1102
Hi I have the following code where I read ORC files from HDFS and it loads directory which contains 12 ORC files. Now since HDFS directory contains 12 files it will create 12 partitions by default. These directory is huge and when ORC files gets decompressed it becomes around 10 GB how do I increas

Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-08 Thread unk1102
Hi as recommended I am caching my Spark job dataframe as dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) but what I see in Spark job UI is this persist stage runs for so long showing 10 GB of shuffle read and 5 GB of shuffle write it takes to long to finish and because of that sometimes my Spa

How to calculate percentile of a column of DataFrame?

2015-10-09 Thread unk1102
Hi how to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way hiveContext.sql("select percentile_approx("mycol",0.25) from myTable); I can see

How to tune unavoidable group by query?

2015-10-09 Thread unk1102
Hi I have the following group by query which I tried to use it both using DataFrame and hiveContext.sql() but both shuffles huge data and is slow. I have around 8 fields passed in as group by fields sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla bla"); OR hiveContext

callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-10-18 Thread unk1102
Hi starting new thread following old thread looks like code for compiling callUdf("percentile_approx",col("mycol"),lit(0.25)) is not merged in spark 1.5.1 source but I dont understand why this function call works in Spark 1.5.1 spark-shell/bin. Please guide. -- Forwarded message --

Spark cant ORC files properly using 1.5.1 hadoop 2.6

2015-10-23 Thread unk1102
Hi I am having weird issue I have a Spark job which has bunch of hiveContext.sql() and creates ORC files as part of hive tables with partitions and it runs fine in 1.4.1 and hadoop 2.4. Now I tried to move to Spark 1.5.1/hadoop 2.6 Spark job does not work as expected it does not created ORC files

Spark 1.5.1 hadoop 2.4 does not clear hive staging files after job finishes

2015-10-26 Thread unk1102
Hi I have spark job which creates hive table partitions I have switched to in spark 1.5.1 and spark 1.5.1 creates so many hive staging files and it doesn't delete it after job finishes. Is it a bug or do I need to disable something to prevents hive staging files from getting created or at least de

How to increase active job count to make spark job faster?

2015-10-27 Thread unk1102
Hi I have long running spark job which processes hadoop orc files and creates one hive partitions. Even if I have created ExecturService thread pool and use pool of 15 threads I see active job count as always 1 which makes job slow. How do I increase active job count in UI? I remember earlier it us

Why does Spark job stucks and waits for only last tasks to get finished

2015-12-03 Thread unk1102
Hi I have Spark job where I keep queue of 12 Spark jobs to execute in parallel. Now I see job is almost completed and only task is pending and because of last task job will keep on waiting I can see in UI. Please see attached snaps. Please help me how to resolve Spark jobs from waiting for last tas

No support to save DataFrame in existing database table using DataFrameWriter.jdbc()

2015-12-06 Thread unk1102
Hi I would like to store/save DataFrame in a database table which is created already and want to insert into always without creating table every time. Unfortunately Spark API forces me to create table every time I have seen Spark source code the following calls uses same method beneath if you caref

How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread unk1102
Hi I have spark job which reads Hive-ORC data and processes and generates csv file in the end. Now this ORC files are hive partitions and I have around 2000 partitions to process every day. These hive partitions size is around 800 GB in HDFS. I have the following method code which I call it from a

What is the correct syntax of using Spark streamingContext.fileStream()?

2015-07-20 Thread unk1102
Hi I am trying to find correct way to use Spark Streaming API streamingContext.fileStream(String,Class,Class,Class) I tried to find example but could not find it anywhere in either Spark documentation. I have to stream files in hdfs which is of custom hadoop format. JavaPairDStream input = stre

SparkR sqlContext or sc not found in RStudio

2015-07-21 Thread unk1102
Hi I could successfully install SparkR package into my RStudio but I could not execute anything against sc or sqlContext. I did the following: Sys.setenv(SPARK_HOME="/path/to/sparkE1.4.1") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),.libPaths())) library(SparkR) Above code installs

Re: SparkR sqlContext or sc not found in RStudio

2015-07-21 Thread unk1102
Hi thanks for the reply. I did download from github build it and it is working fine I can use spark-submit etc when I use it in RStudio I dont know why it is saying sqlContext not found When I do the following > sqlContext < sparkRSQL.init(sc) Error: object sqlContext not found if I do the follo

Spark DataFrame created from JavaRDD copies all columns data into first column

2015-07-22 Thread unk1102
Hi I have a DataFrame which I need to convert into JavaRDD and back to DataFrame I have the following code DataFrame sourceFrame = hiveContext.read().format("orc").load("/path/to/orc/file"); //I do order by in above sourceFrame and then I convert it into JavaRDD JavaRDD modifiedRDD = sourceFrame.t

Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread unk1102
Hi I have Spark Streaming code which streams from Kafka topic it used to work fine but suddenly it started throwing the following exception Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leader offsets for Set() at org.apache.spark.st

How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-30 Thread unk1102
Hi I have one Spark job which runs fine locally with less data but when I schedule it on YARN to execute I keep on getting the following ERROR and slowly all executors gets removed from UI and my job fails 15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on myhost1.com: remote Rpc cl

How to create Spark DataFrame using custom Hadoop InputFormat?

2015-07-31 Thread unk1102
Hi I am having my own Hadoop custom InputFormat which I need to use in creating DataFrame. I tried to do the following JavaPairRDD myFormatAsPairRdd = jsc.hadoopFile("hdfs://tmp/data/myformat.xyz",MyInputFormat.class,Void.class,MyRecordWritable.class); JavaRDD myformatRdd = myFormatAsPairRdd.valu

Best practices to call hiveContext in DataFrame.foreach in executor program or how to have a for loop in driver program

2015-08-05 Thread unk1102
Hi I have the following code which fires hiveContext.sql() most of the time. My task is I want to create few table and insert values into after processing for all hive table partition. So I first fire show partitions and using its output in a for loop I call few methods which creates table if not e

How to create DataFrame from a binary file?

2015-08-08 Thread unk1102
Hi how do we create DataFrame from a binary file stored in HDFS? I was thinking to use JavaPairRDD pairRdd = javaSparkContext.binaryFiles("/hdfs/path/to/binfile"); JavaRDD javardd = pairRdd.values(); I can see that PortableDataStream has method called toArray which can convert into byte array I w

How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread unk1102
Hi I have my own Hadoop custom InputFormat which I want to use in DataFrame. How do we do that? I know I can use sc.hadoopFile(..) but then how do I convert it into DataFrame JavaPairRDD myFormatAsPairRdd = jsc.hadoopFile("hdfs://tmp/data/myformat.xyz",MyInputFormat.class,Void.class,MyRecordWritab

Spark executor lost because of time out even after setting quite long time out value 1000 seconds

2015-08-16 Thread unk1102
Hi I have written Spark job which seems to be working fine for almost an hour and after that executor start getting lost because of timeout I see the following in log statement 15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with no recent heartbeats: 1051638 ms exceeds timeou

Example code to spawn multiple threads in driver program

2015-08-16 Thread unk1102
Hi I have Spark driver program which has one loop which iterates for around 2000 times and for two thousands times it executes jobs in YARN. Since loop will do the job serially I want to introduce parallelism If I create 2000 tasks/runnable/callable in my Spark driver program will it get executed i

Re: Apache Spark - Parallel Processing of messages from Kafka - Java

2015-08-17 Thread unk1102
val numStreams = 4 val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) } In a Java in a for loop you will create four streams using KafkaUtils.createStream() so that each receiver will run in different threads for more information please visit http://spark.apache.org/doc

Calling hiveContext.sql("insert into table xyz...") in multiple threads?

2015-08-17 Thread unk1102
Hi I have around 2000 Hive source partitions to process and insert data into same table and different partition. For e.g. I have the following query hiveContext.sql("insert into table myTable partition(mypartition="someparition") bla bla) If I call above query in Spark driver program it runs fine

Spark executor lost because of GC overhead limit exceeded even though using 20 executors using 25GB each

2015-08-18 Thread unk1102
Hi this GC overhead limit error is making me crazy. I have 20 executors using 25 GB each I dont understand at all how can it throw GC overhead I also dont that that big datasets. Once this GC error occurs in executor it will get lost and slowly other executors getting lost because of IOException, R

How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-19 Thread unk1102
I have one Spark job which seems to run fine but after one hour or so executor start getting lost because of time out something like the following error cluster.yarnScheduler : Removing an executor 14 65 timeout exceeds 60 seconds and because of above error couple of chained errors starts

spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-19 Thread unk1102
Hi I have a Spark job which deals with large skewed dataset. I have around 1000 Hive partitions to process in four different tables every day. So if I go with 200 spark.sql.shuffle.partitions default partitions created by Spark I end up with 4 * 1000 * 200 = 8 small small files in HDFS which wo

Spark YARN executors are not launching when using +UseG1GC

2015-08-23 Thread unk1102
Hi I am hitting issue of long GC pauses in my Spark job and because of it YARN is killing executors one by one and Spark job becomes slower and slower. I came across this article where they mentioned about using G1GC I tried to use the same command but something seems wrong https://databricks.com/

Spark executor OOM issue on YARN

2015-08-31 Thread unk1102
Hi I have Spark job and its executors hits OOM issue after some time and my job hangs because of it followed by couple of IOException, Rpc client disassociated, shuffle not found etc I have tried almost everything dont know how do I solve this OOM issue please guide I am fed up now. Here what I tr

What should be the optimal value for spark.sql.shuffle.partition?

2015-09-01 Thread unk1102
Hi I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partition from 200 default to 1000 but it is not helping. Please correct me if I am wrong this partitions will share data shuffle loa

Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

2015-09-02 Thread unk1102
Hi I have a Spark dataframe which I want to save as hive table with partitions. I tried the following two statements but they dont work I dont see any ORC files in HDFS directory its empty. I can see baseTable is there in Hive console but obviously its empty because of no files inside HDFS. The fol

Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-04 Thread unk1102
Hi I have Spark job which does some processing on ORC data and stores back ORC data using DataFrameWriter save() API introduced in Spark 1.4.0. I have the following piece of code which is using heavy shuffle memory. How do I optimize below code? Is there anything wrong with it? It is working fine a

NPE while reading ORC file using Spark 1.4 API

2015-09-08 Thread unk1102
Hi I read many ORC files in Spark and process it those files are basically Hive partitions. Most of the times processing goes well but for few files I get the following exception dont know why? These files are working fine in Hive using Hive queries. Please guide. Thanks in advance. DataFrame df =

Spark rdd.mapPartitionsWithIndex() hits physical memory limit after huge data shuffle

2015-09-09 Thread unk1102
Hi I have the following Spark code which involves huge data shuffling even though using mapPartitionswithIndex() with shuffle false. I have 2 TB of skewed data to process and then convert rdd into dataframe and use it as table in hiveContext.sql(). I am using 60 executors with 20 GB memory and 4 co

How to enable Tungsten in Spark 1.5 for Spark SQL?

2015-09-10 Thread unk1102
Hi Spark 1.5 looks promising how do we enable project tungsten for spark sql or is it enabled by default please guide. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-Tungsten-in-Spark-1-5-for-Spark-SQL-tp24642.html Sent from the Apache Spark

How to create broadcast variable from Java String array?

2015-09-12 Thread unk1102
Hi I have Java String array which contains 45 string which is basically Schema String[] fieldNames = {"col1","col2",...}; Currently I am storing above array of String in a driver static field. My job is running slow so trying to refactor code I am using String array in creating DataFrame DataFr

How do debug YARN client OOM issue?

2015-09-12 Thread unk1102
Hi my Spark job runs fine for one hour or so then it starts loosing executors because OOM I want to debug this issue where is it causing OOM I heard we can use VisualVM,JCONSOLE etc how do we use in Spark I am using YARN client mode to submit my Spark job? How do I set JMX param? I mean it should p

Why my Spark job is slow and it throws OOM which leads YARN killing executors?

2015-09-12 Thread unk1102
Hi I have the following Spark driver program/job which reads ORC files (i.e. hive partitions as HDFS directories) process them in DataFrame and use them as table in hiveContext.sql(). Job runs fine it gives correct results but it hits physical memory limit after one hour or so and YARN kills execut

How to Hive UDF in Spark DataFrame?

2015-09-13 Thread unk1102
Hi I am using UDF in hiveContext.sql("") query inside it uses group by which forces huge data shuffle read of around 30 GB I am thinking to convert above query into DataFrame so that I avoid using group by. How do we use Hive UDF in Spark DataFrame? Please guide. Thanks much. -- View this mess

Best way to merge final output part files created by Spark job

2015-09-13 Thread unk1102
Hi I have a spark job which creates around 500 part files inside each directory I process. So I have thousands of such directories. So I need to merge these small small 500 part files. I am using spark.sql.shuffle.partition as 500 and my final small files are ORC files. Is there a way to merge orc

Best practices for scheduling Spark jobs on "shared" YARN cluster using Autosys

2015-09-25 Thread unk1102
Hi I have 5 Spark jobs which needs to be run in parallel to speed up process they take around 6-8 hours together. I have 93 container nodes with 8 cores each memory capacity of around 2.8 TB. Now I runs each jobs with around 30 executors with 2 cores and 20 GB each. My each jobs processes around 1

Parquet error while saving in HDFS

2017-07-24 Thread unk1102
Hi I am getting the following error not sure why seems like race condition but I dont use any threads just one thread which owns spark context is writing to hdfs with one parquet partition. I am using Scala 2.10 and Spark 1.5.1. Please guide. Thanks in advance. java.io.IOException: The file being

org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi I am getting the following exception when I try to write DataFrame using the following code. Please guide. I am using Spark 2.2.0. df.write.format("parquet").mode(SaveMode.Append); org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.Fil

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi thanks for the reply I only see NPE and Task failed while writing rows all over places I dont see any other errors expect SparkException job aborted and followed by two exception I pasted earlier. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ ---

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi thanks Vadim you are right I saw that line already 468 I dont see any code it is just comment yes I am sure I am using all spark-* jar which is built for spark 2.2.0 and Scala 2.11. I am also stuck unfortunately with these errors not sure how to solve them. -- Sent from: http://apache-spark-u

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi Vadim thanks I use HortonWorks package. I dont think there are any seg faults are dataframe I am trying to write is very small in size. Can it still create seg fault? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --

Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi I need guidance on dealing with large no of pdf files when using Hadoop and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it into text using these parsers and store it as text files but in doing so

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi Nicolas thanks much for the reply. Do you have any sample code somewhere? Do your just keep pdf in avro binary all the time? How often you parse into text using pdfbox? Is it on demand basis or you always parse as text and keep pdf as binary in avro as just interim state? -- Sent from: http:/

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi Nicolas thanks much for guidance it was very useful information if you can push that code to github and share url it would be a great help. Looking forward. If you can find time to push early it would be even greater help as I have to finish POC on this use case ASAP. -- Sent from: http://apa

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Thanks much Nicolas really appreciate it. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Best practices to keep multiple version of schema in Spark

2018-04-30 Thread unk1102
Hi I have a couple of datasets where schema keep on changing and I store it as parquet files. Now I use mergeSchema option while loading these different schema parquet files in a DataFrame and it works all fine. Now I have a requirement of maintaining difference between schema over time basically m

Spark horizontal scaling is not supported in which cluster mode? Ask

2018-05-21 Thread unk1102
Hi I came by one Spark question which was about which spark cluster manager does not support horizontal scalability? Answer options were Mesos, Yarn, Standalone and local mode. I believe all cluster managers are horizontal scalable please correct if I am wrong. And I think answer is local mode. Is

Is Spark DataFrame limit function action or transformation?

2018-05-31 Thread unk1102
Is Spark DataFrame limit function action or transformation? I think it returns DataFrame so it should be a transformation but it executes entire DAG so I think it is action. Same goes to persist function. Please guide. Thanks in advance. -- Sent from: http://apache-spark-user-list.1001560.n3.nab

Best practices on how to multiple spark sessions

2018-09-16 Thread unk1102
Hi I have application which servers as ETL job and I have hundreds of such ETL jobs which runs daily now as of now I have just one spark session which is shared by all these jobs and sometimes all of these jobs run at the same time causing spark session to die due memory issues mostly. Is this a go