Write to same hdfs dir from multiple spark jobs

2020-07-29 Thread Deepak Sharma
Hi Is there any design pattern around writing to the same hdfs directory from multiple spark jobs? -- Thanks Deepak www.bigdatabig.com

Re: Edge AI with Spark

2020-09-24 Thread Deepak Sharma
Near edge would work in this case. On Edge doesn't makes much sense , specially if its distributed processing framework such as spark. On Thu, Sep 24, 2020 at 3:12 PM Gourav Sengupta wrote: > hi, > > its better to use lighter frameworks over edge. Some of the edge devices I > work on run at over

Re: Profiling spark application

2022-01-19 Thread Deepak Sharma
You can take a look at jvm profiler that was open sourced by uber: https://github.com/uber-common/jvm-profiler On Thu, Jan 20, 2022 at 11:20 AM Prasad Bhalerao < prasadbhalerao1...@gmail.com> wrote: > Hi, > > It will require code changes and I am looking at some third party code , I > am lookin

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Deepak Sharma
coalesce returns a new dataset. That will cause the recomputation. Thanks Deepak On Sun, 30 Jan 2022 at 14:06, Benjamin Du wrote: > I have some PySpark code like below. Basically, I persist a DataFrame > (which is time-consuming to compute) to disk, call the method > DataFrame.count to trigger

Re: spark as data warehouse?

2022-03-25 Thread Deepak Sharma
It can be used as warehouse but then you have to keep long running spark jobs. This can be possible using cached data frames or dataset . Thanks Deepak On Sat, 26 Mar 2022 at 5:56 AM, wrote: > In the past time we have been using hive for building the data > warehouse. > Do you think if spark ca

Re: Will it lead to OOM error?

2022-06-22 Thread Deepak Sharma
It will spill to disk if everything can’t be loaded in memory . On Wed, 22 Jun 2022 at 5:58 PM, Sid wrote: > I have a 150TB CSV file. > > I have a total of 100 TB RAM and 100TB disk. So If I do something like this > > spark.read.option("header","true").csv(filepath).show(false) > > Will it lead

Spark Issue with Istio in Distributed Mode

2022-09-02 Thread Deepak Sharma
Hi All, In 1 of our cluster , we enabled Istio where spark is running in distributed mode. Spark works fine when we run it with Istio in standalone mode. In spark distributed mode , we are seeing that every 1 hour or so the workers are getting disassociated from master and then master is not able t

Re: Spark Issue with Istio in Distributed Mode

2022-09-03 Thread Deepak Sharma
at 12:17 AM Deepak Sharma > wrote: > >> Hi All, >> In 1 of our cluster , we enabled Istio where spark is running in >> distributed mode. >> Spark works fine when we run it with Istio in standalone mode. >> In spark distributed mode , we are seeing that every 1 hou

Re: Spark Issue with Istio in Distributed Mode

2022-09-11 Thread Deepak Sharma
oy-v3-api-field-config-core-v3-httpprotocoloptions-idle-timeout > > > On Sat, Sep 3, 2022 at 4:23 AM Deepak Sharma > wrote: > >> Thank for the reply IIan . >> Can we set this in spark conf or does it need to goto istio / envoy conf? >> >> >> >> On S

Re: Online classes for spark topics

2023-03-08 Thread Deepak Sharma
I can prepare some topics and present as well , if we have a prioritised list of topics already . On Thu, 9 Mar 2023 at 11:42 AM, Denny Lee wrote: > We used to run Spark webinars on the Apache Spark LinkedIn group > but > honestly

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Deepak Sharma
+1 . I can contribute to it as well . On Tue, 19 Mar 2024 at 9:19 AM, Code Tutelage wrote: > +1 > > Thanks for proposing > > On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud > wrote: > >> Good idea. Will be useful >> >> >> >> +1 >> >> >> >> >> >> >> >> *From: *ashok34...@yahoo.com.INVALID >> *

Re: Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Deepak Sharma
There is Spark action defined for oozie workflows. Though I am not sure if it supports only Java SPARK jobs or Scala jobs as well. https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html Thanks Deepak On Mon, Mar 7, 2016 at 2:44 PM, Divya Gehlot wrote: > Hi, > > Could somebody help me b

Re: Detecting application restart when running in supervised cluster mode

2016-04-05 Thread Deepak Sharma
Hi Rafael If you are using yarn as the engine , you can always use RM UI to see the application progress. Thanks Deepak On Tue, Apr 5, 2016 at 12:18 PM, Rafael Barreto wrote: > Hello, > > I have a driver deployed using `spark-submit` in supervised cluster mode. > Sometimes my application would

LinkedIn streams in Spark

2016-04-10 Thread Deepak Sharma
Hello All, I am looking for a use case where anyone have used spark streaming integration with LinkedIn. -- Thanks Deepak

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
Once you download hadoop and format the namenode , you can use start-dfs.sh to start hdfs. Then use 'jps' to sss if datanode/namenode services are up and running. Thanks Deepak On Mon, Apr 18, 2016 at 5:18 PM, My List wrote: > Hi , > > I am a newbie on Spark.I wanted to know how to start and ve

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
have to build it? > 3) Is there a basic tutorial for Hadoop on windows for the basic needs of > Spark. > > Thanks in Advance ! > > On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma > wrote: > >> Once you download hadoop and format the namenode , you can use >> st

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
e I am starting afresh, what would you advice? > > On Mon, Apr 18, 2016 at 5:45 PM, Deepak Sharma > wrote: > >> Binary for Spark means ts spark built against hadoop 2.6 >> It will not have any hadoop executables. >> You'll have to setup hadoop separately. >>

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
un big data stuff on windows. Have run in so much of issues that I could > just throw the laptop with windows out. > > Your view - Redhat, Ubuntu or Centos. > Does Redhat give a one year licence on purchase etc? > > Thanks > > On Mon, Apr 18, 2016 at 5:52 PM, Deepak Sharma >

Processing millions of messages in milliseconds -- Architecture guide required

2016-04-18 Thread Deepak Sharma
Hi all, I am looking for an architecture to ingest 10 mils of messages in the micro batches of seconds. If anyone has worked on similar kind of architecture , can you please point me to any documentation around the same like what should be the architecture , which all components/big data ecosystem

Re: migration from Teradata to Spark SQL

2016-05-03 Thread Deepak Sharma
Hi Tapan I would suggest an architecture where you have different storage layer and data servng layer. Spark is still best for batch processing of data. So what i am suggesting here is you can have your data stored as it is in some hdfs raw layer , run your ELT in spark on this raw data and further

Re: Spark structured streaming is Micro batch?

2016-05-06 Thread Deepak Sharma
With Structured Streaming ,Spark would provide apis over spark sql engine. Its like once you have the structured stream and dataframe created out of this , you can do ad-hoc querying on the DF , which means you are actually querying the stram without having to store or transform. I have not used it

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Deepak Sharma
Spark 2.0 is yet to come out for public release. I am waiting to get hands on it as well. Please do let me know if i can download source and build spark2.0 from github. Thanks Deepak On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind wrote: > Hi All, > > We are evaluating a few real time streaming q

Re: Cluster Migration

2016-05-10 Thread Deepak Sharma
Hi Ajay You can look at wholeTextFiles method of rdd[string,string] and then map each of rdd to saveAsTextFile . This will serve the purpose . I don't think if anything default like distcp exists in spark Thanks Deepak On 10 May 2016 11:27 pm, "Ajay Chander" wrote: > Hi Everyone, > > we are pla

Re: Cluster Migration

2016-05-10 Thread Deepak Sharma
mpression codec on it, save the rdd to another Hadoop cluster? > > Thank you, > Ajay > > On Tuesday, May 10, 2016, Deepak Sharma wrote: > >> Hi Ajay >> You can look at wholeTextFiles method of rdd[string,string] and then map >> each of rdd to saveAsTextFile

Re: Setting Spark Worker Memory

2016-05-11 Thread Deepak Sharma
Since you are registering workers from the same node , do you have enough cores and RAM(In this case >=9 cores and > = 24 GB ) on this node(11.14.224.24)? Thanks Deepak On Wed, May 11, 2016 at 9:08 PM, شجاع الرحمن بیگ wrote: > Hi All, > > I need to set same memory and core for each worker on sa

Re: Graceful shutdown of spark streaming on yarn

2016-05-11 Thread Deepak Sharma
Hi Rakesh Did you tried setting *spark.streaming.stopGracefullyOnShutdown to true *for your spark configuration instance? If not try this , and let us know if this helps. Thanks Deepak On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) < rakes...@flipkart.com> wrote: > Issue i a

Re: Graceful shutdown of spark streaming on yarn

2016-05-11 Thread Deepak Sharma
er$: VALUE -> 205 > 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 206 > > > > > > > On Thu, May 12, 2016 at 11:45 AM Deepak Sharma > wrote: > >> Hi Rakesh >> Did you tried setting *spark.streaming.stopGracefu

Re: Graceful shutdown of spark streaming on yarn

2016-05-11 Thread Deepak Sharma
(Marketing Platform-BLR) < rakes...@flipkart.com> wrote: > Yes, it seems to be the case. > In this case executors should have continued logging values till 300, but > they are shutdown as soon as i do "yarn kill .." > > On Thu, May 12, 2016 at 12:11 PM Deepak Sharma

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Deepak Sharma
dead and it shuts down abruptly. >> Could this issue be related to yarn? I see correct behavior locally. I >> did "yarn kill " to kill the job. >> >> >> On Thu, May 12, 2016 at 12:28 PM Deepak Sharma >> wrote: >> >>> This is happenin

Debug spark core and streaming programs in scala

2016-05-15 Thread Deepak Sharma
Hi I have scala program consisting of spark core and spark streaming APIs Is there any open source tool that i can use to debug the program for performance reasons? My primary interest is to find the block of codes that would be exeuted on driver and what would go to the executors. Is there JMX ext

How to map values read from text file to 2 different set of RDDs

2016-05-22 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD of t

How to map values read from test file to 2 different RDDs

2016-05-23 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD of t

Re: Query related to spark cluster

2016-05-29 Thread Deepak Sharma
Hi Saurabh You can have hadoop cluster running YARN as scheduler. Configure spark to run with the same YARN setup. Then you need R only on 1 node , and connect to the cluster using the SparkR. Thanks Deepak On Mon, May 30, 2016 at 12:12 PM, Jörn Franke wrote: > > Well if you require R then you

Re: Accessing s3a files from Spark

2016-05-31 Thread Deepak Sharma
Hi Mayuresh Instead of s3a , have you tried the https:// uri for the same s3 bucket? HTH Deepak On Tue, May 31, 2016 at 4:41 PM, Mayuresh Kunjir wrote: > > > On Tue, May 31, 2016 at 5:29 AM, Steve Loughran > wrote: > >> which s3 endpoint? >> >> > ​I have tried both s3.amazonaws.com and s3-exte

Re: Spark_Usecase

2016-06-07 Thread Deepak Sharma
I am not sure if Spark provides any support for incremental extracts inherently. But you can maintain a file e.g. extractRange.conf in hdfs , to read from it the end range and update it with new end range from spark job before it finishes with the new relevant ranges to be used next time. On Tue,

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-13 Thread Deepak Sharma
Hi Ajay Looking at spark code , i can see you used hive context. Can you try using sql context instead of hive context there? Thanks Deepak On Mon, Jun 13, 2016 at 10:15 PM, Ajay Chander wrote: > Hi Mohit, > > Thanks for your time. Please find my response below. > > Did you try the same with a

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Deepak Sharma
for what I already set in Alluxio 512MB per block. > > > On Jul 1, 2016, at 11:01 AM, Deepak Sharma wrote: > > Before writing coalesing your rdd to 1 . > It will create only 1 output file . > Multiple part file happens as all your executors will be writing their > partit

Re: One map per folder in spark or Hadoop

2016-07-07 Thread Deepak Sharma
You have to distribute the files in some distributed file system like hdfs. Or else copy the files to all executors local file system and make sure to mention the file scheme in the URI explicitly. Thanks Deepak On Thu, Jul 7, 2016 at 7:13 PM, Balachandar R.A. wrote: > Hi > > Thanks for the cod

Re: Is there a way to dynamic load files [ parquet or csv ] in the map function?

2016-07-08 Thread Deepak Sharma
Yes .You can do something like this : .map(x=>mapfunction(x)) Thanks Deepak On 9 Jul 2016 9:22 am, "charles li" wrote: > > hi, guys, is there a way to dynamic load files within the map function. > > i.e. > > Can I code as bellow: > > > ​ > > thanks a lot. > ​ > > > -- > *___* > ​

Re: RDD for loop vs foreach

2016-07-12 Thread Deepak Sharma
Hi Phil I guess for() is executed on the driver while foreach() will execute it in parallel. You can try this without collecting the rdd try both . foreach in this case would print on executors and you would not see anything on the driver console. Thanks Deepak On Tue, Jul 12, 2016 at 9:28 PM, ph

Re: Storm HDFS bolt equivalent in Spark Streaming.

2016-07-19 Thread Deepak Sharma
In spark streaming , you have to decide the duration of micro batches to run. Once you get the micro batch , transform it as per your logic and then you can use saveAsTextFiles on your final RDD to write it to HDFS. Thanks Deepak On 20 Jul 2016 9:49 am, wrote: *Dell - Internal Use - Confidentia

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-19 Thread Deepak Sharma
I am using DAO in spark application to write the final computation to Cassandra and it performs well. What kinds of issues you foresee using DAO for hbase ? Thanks Deepak On 19 Jul 2016 10:04 pm, "Yu Wei" wrote: > Hi guys, > > > I write spark application and want to store results generated by

Re: What are using Spark for

2016-08-02 Thread Deepak Sharma
Yes.I am using spark for ETL and I am sure there are lot of other companies who are using spark for ETL. Thanks Deepak On 2 Aug 2016 11:40 pm, "Rohit L" wrote: > Does anyone use Spark for ETL? > > On Tue, Aug 2, 2016 at 1:24 PM, Sonal Goyal wrote: > >> Hi Rohit, >> >> You can check the powered

Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread Deepak Sharma
I am facing the same issue with spark 1.5.2 If the file size that's being processed by spark , is of size 10-12 MB , it throws out of memory . But if the same file is within 5 MB limit , it runs fine. I am using spark configuration with 7GB of memory and 3 cores for executors in the cluster of 8 ex

Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread Deepak Sharma
lmost everything i could after searching online. > > Any help from the mailing list would be appreciated. > > On Thu, Aug 4, 2016 at 7:43 AM, Deepak Sharma > wrote: > >> I am facing the same issue with spark 1.5.2 >> If the file size that's being processed by spa

Long running tasks in stages

2016-08-06 Thread Deepak Sharma
I am doing join over 1 dataframe and a empty data frame. The first dataframe got almost 50k records. This operation nvere returns back and runs indefinitely. Is there any solution to get around this? -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: Is Spark right for my use case?

2016-08-08 Thread Deepak Sharma
Hi Danellis For point 1 , spark streaming is something to look at. For point 2 , you can create DAO from cassandra on each stream processing.This may be costly operation though , but to do real time processing of data , you have to live with t. Point 3 is covered in point 2 above. Since you are sta

Re: What are the configurations needs to connect spark and ms-sql server?

2016-08-08 Thread Deepak Sharma
Hi Devi Please make sure the jdbc jar is in the spark classpath. With spark-submit , you can use --jars option to specify the sql server jdbc jar. Thanks Deepak On Mon, Aug 8, 2016 at 1:14 PM, Devi P.V wrote: > Hi all, > > I am trying to write a spark dataframe into MS-Sql Server.I have tried >

Best practises around spark-scala

2016-08-08 Thread Deepak Sharma
Hi All, Can anyone please give any documents that may be there around spark-scala best practises? -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: Best practises around spark-scala

2016-08-08 Thread Deepak Sharma
ing links are good as I am using same. > > http://spark.apache.org/docs/latest/tuning.html > > https://spark-summit.org/2014/testing-spark-best-practices/ > > Regards, > Vaquar khan > > On 8 Aug 2016 10:11, "Deepak Sharma" wrote: > >> Hi All, >> Ca

Re: Spark join and large temp files

2016-08-08 Thread Deepak Sharma
Register you dataframes as temp tables and then try the join on the temp table. This should resolve your issue. Thanks Deepak On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab wrote: > Hello, > We have two parquet inputs of the following form: > > a: id:String, Name:String (1.5TB) > b: id:String,

Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread Deepak Sharma
Can you please post the code snippet and the error you are getting ? -Deepak On 9 Aug 2016 12:18 am, "manish jaiswal" wrote: > Hi, > > I am not able to read data from hive transactional table using sparksql. > (i don't want read via hive jdbc) > > > > Please help. >

Use cases around image/video processing in spark

2016-08-10 Thread Deepak Sharma
Hi If anyone is using or knows about github repo that can help me get started with image and video processing using spark. The images/videos will be stored in s3 and i am planning to use s3 with Spark. In this case , how will spark achieve distributed processing? Any code base or references is real

Re: Apache Spark toDebugString producing different output for python and scala repl

2016-08-15 Thread DEEPAK SHARMA
Also one of cloudera slides say that the default partitions is 2 however its 1 (looking at output of toDebugString). Appreciate any help. Thanks Deepak Sharma

Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread Deepak Sharma
An approach I can think of is using Ambari Metrics Service(AMS) Using these metrics , you can decide upon if the cluster is low in resources. If yes, call the Ambari management API to add the node to the cluster. Thanks Deepak On Mon, Dec 14, 2015 at 2:48 PM, cs user wrote: > Hi Mingyu, > > I'

Re: Yarn application ID for Spark job on Yarn

2015-12-18 Thread Deepak Sharma
I have never tried this but there is yarn client api's that you can use in your spark program to get the application id. Here is the link to the yarn client java doc: http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/yarn/client/api/YarnClient.html getApplications() is the method for your

Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
Hi Sandeep I am not sure if ORC can be read directly in R. But there can be a workaround .First create hive table on top of ORC files and then access hive table in R. Thanks Deepak On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana wrote: > Hello > > I need to read an ORC files in hdfs in R using

Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
tarted, Spark operations need to be > re-executed. > > > Not sure what is causing this? Any leads or ideas? I am using rstudio. > > > > On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma > wrote: > >> Hi Sandeep >> I am not sure if ORC can be read directly

Re: Newbie question

2016-01-07 Thread Deepak Sharma
Yes , you can do it unless the method is marked static/final. Most of the methods in SparkContext are marked static so you can't over ride them definitely , else over ride would work usually. Thanks Deepak On Fri, Jan 8, 2016 at 12:06 PM, yuliya Feldman wrote: > Hello, > > I am new to Spark and

Best practises

2015-10-30 Thread Deepak Sharma
Hi I am looking for any blog / doc on the developer's best practices if using Spark .I have already looked at the tuning guide on spark.apache.org. Please do let me know if any one is aware of any such resource. Thanks Deepak

Spark RDD cache persistence

2015-11-05 Thread Deepak Sharma
Hi All I am confused on RDD persistence in cache . If I cache RDD , is it going to stay there in memory even if my spark program completes execution , which created it. If not , how can I guarantee that RDD is persisted in cache even after the program finishes execution. Thanks Deepak

Re: Spark RDD cache persistence

2015-11-05 Thread Deepak Sharma
an" wrote: > The cache gets cleared out when the job finishes. I am not aware of a way > to keep the cache around between jobs. You could save it as an object file > to disk and load it as an object file on your next job for speed. > On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma &

Re: Hive on Spark orc file empty

2015-11-16 Thread Deepak Sharma
Sai, I am bit confused here. How are you using write with results? I am using spark 1.4.1 and when i use write , it complains about write not being member of DataFrame. error:value write is not a member of org.apache.spark.sql.DataFrame Thanks Deepak On Mon, Nov 16, 2015 at 4:10 PM, 张炜 wrote: >

Any role for volunteering

2015-12-04 Thread Deepak Sharma
Hi All Sorry for spamming your inbox. I am really keen to work on a big data project full time(preferably remote from India) , if not I am open to volunteering as well. Please do let me know if there is any such opportunity available -- Thanks Deepak

Re: Spark 2.0 - Join statement compile error

2016-08-22 Thread Deepak Sharma
Hi Subhajit Try this in your join: *val* *df** = **sales_demand**.**join**(**product_master**,**sales_demand**.$"INVENTORY_ITEM_ID" =**== **product_master**.$"INVENTORY_ITEM_ID",**"inner"**)* On Tue, Aug 23, 2016 at 2:30 AM, Subhajit Purkayastha wrote: > *All,* > > > > *I have the following dat

Re: Spark 2.0 - Join statement compile error

2016-08-22 Thread Deepak Sharma
On Tue, Aug 23, 2016 at 10:32 AM, Deepak Sharma wrote: > *val* *df** = > **sales_demand**.**join**(**product_master**,**sales_demand**.$"INVENTORY_ITEM_ID" > =**== **product_master**.$"INVENTORY_ITEM_ID",**"inner"**)* Ignore the last statement. It sho

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Deepak Sharma
Is it possible to execute any query using SQLContext even if the DB is secured using roles or tools such as Sentry? Thanks Deepak On Tue, Aug 30, 2016 at 7:52 PM, Rajani, Arpan wrote: > Hi All, > > In our YARN cluster, we have setup spark 1.6.1 , we plan to give access to > all the end users/de

Re: Calling udf in Spark

2016-09-08 Thread Deepak Sharma
No its not required for UDF. Its required when you convert from rdd to df. Thanks Deepak On 8 Sep 2016 2:25 pm, "Divya Gehlot" wrote: > Hi, > > Is it necessary to import sqlContext.implicits._ whenever define and > call UDF in Spark. > > > Thanks, > Divya > > >

Re: Assign values to existing column in SparkR

2016-09-09 Thread Deepak Sharma
Data frames are immutable in nature , so i don't think you can directly assign or change values on the column. Thanks Deepak On Fri, Sep 9, 2016 at 10:59 PM, xingye wrote: > I have some questions about assign values to a spark dataframe. I want to > assign values to an existing column of a spar

Re: Ways to check Spark submit running

2016-09-13 Thread Deepak Sharma
Use yarn-client mode and you can see the logs n console after you submit. On Tue, Sep 13, 2016 at 11:47 AM, Divya Gehlot wrote: > Hi, > > Some how for time being I am unable to view Spark Web UI and Hadoop Web > UI. > Looking for other ways ,I can check my job is running fine apart from keep >

Re: how to specify cores and executor to run spark jobs simultaneously

2016-09-13 Thread Deepak Sharma
I am not sure about EMR , but seems multi tenancy is not enabled in your case. Multi tenancy means all the applications has to be submitted to different queues. Thanks Deepak On Wed, Sep 14, 2016 at 11:37 AM, Divya Gehlot wrote: > Hi, > > I am on EMR cluster and My cluster configuration is as b

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Deepak Sharma
Hi Anupama To me it looks like issue with the SPN with which you are trying to connect to hive2 , i.e. hive@hostname. Are you able to connect to hive from spark-shell? Try getting the tkt using any other user keytab but not hadoop services keytab and then try running the spark submit. Thanks

Re: Convert RDD to JSON Rdd and append more information

2016-09-20 Thread Deepak Sharma
Enrich the RDDs first with more information and then map it to some case class , if you are using scala. You can then use play api's (play.api.libs.json.Writes/play.api.libs.json.Json) classes to convert the mapped case class to json. Thanks Deepak On Tue, Sep 20, 2016 at 6:42 PM, sujeet jog wro

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
What is the message inflow ? If it's really high , definitely spark will be of great use . Thanks Deepak On Sep 29, 2016 19:24, "Ali Akhtar" wrote: > I have a somewhat tricky use case, and I'm looking for ideas. > > I have 5-6 Kafka producers, reading various APIs, and writing their raw > data

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destru

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
t;>> >>>> >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcP

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
t; Is there an advantage to that vs directly consuming from Kafka? Nothing > is > > being done to the data except some light ETL and then storing it in > > Cassandra > > > > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma > > wrote: > >> > >> Its bet

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
u're doing >> analytics, I wouldn't want to give up the ability to easily do ad-hoc >> aggregations without a lot of forethought. If you're worried about >> scaling, there are several options for horizontally scaling Postgres >> in particular. One of the current

Re: Filtering in SparkR

2016-10-03 Thread Deepak Sharma
Hi Yogesh You can try registering these 2 DFs as temporary table and then execute the sql query. df1.registerTempTable("df1") df2.registerTempTable("df2") val rs = sqlContext.sql("SELECT a.* FROM df1 a, df2 b where a.id != b.id) Thanks Deepak On Mon, Oct 3, 2016 at 12:38 PM, Yogesh Vyas wrote:

Re: Writing/Saving RDD to HDFS using saveAsTextFile

2016-10-07 Thread Deepak Sharma
Hi Mahendra Did you tried mapping the X case class members further to a String object and then saving the RDD[String] ? Thanks Deepak On Oct 7, 2016 23:04, "Mahendra Kutare" wrote: > Hi, > > I am facing issue with writing RDD[X] to HDFS file path. X is a simple > case class with variable time

Re: Optimized way to use spark as db to hdfs etl

2016-11-05 Thread Deepak Sharma
Hi Rohit You can use accumulators and increase it on every record processing. At last you can get the value of accumulator on driver , which will give you the count. HTH Deepak On Nov 5, 2016 20:09, "Rohit Verma" wrote: > I am using spark to read from database and write in hdfs as parquet file.

Re: Possible DR solution

2016-11-11 Thread Deepak Sharma
This is waste of money I guess. On Nov 11, 2016 22:41, "Mich Talebzadeh" wrote: > starts at $4,000 per node per year all inclusive. > > With discount it can be halved but we are talking a node itself so if you > have 5 nodes in primary and 5 nodes in DR we are talking about $40K already. > > HTH

Re: Possible DR solution

2016-11-11 Thread Deepak Sharma
r any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 11 November 2016 at 17:11, Deepak Shar

Re: what is the optimized way to combine multiple dataframes into one dataframe ?

2016-11-16 Thread Deepak Sharma
Can you try caching the individual dataframes and then union them? It may save you time. Thanks Deepak On Wed, Nov 16, 2016 at 12:35 PM, Devi P.V wrote: > Hi all, > > I have 4 data frames with three columns, > > client_id,product_id,interest > > I want to combine these 4 dataframes into one dat

Re: Spark 2.0.2 , using DStreams in Spark Streaming . How do I create SQLContext? Please help

2016-11-30 Thread Deepak Sharma
In Spark > 2.0 , spark session was introduced that you can use to query hive as well. Just make sure you create spark session with enableHiveSupport() option. Thanks Deepak On Thu, Dec 1, 2016 at 12:27 PM, shyla deshpande wrote: > I am Spark 2.0.2 , using DStreams because I need Cassandra Sink.

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
This is how you can do it in scala: scala> val ts1 = from_unixtime($"ts", "-MM-dd") ts1: org.apache.spark.sql.Column = fromunixtime(ts,-MM-dd) scala> val finaldf = df.withColumn("ts1",ts1) finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string, ts1: string] scala> finald

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
01| |3bc61951-0f49-43b...|1477983725292|2016-11-01| |688acc61-753f-4a3...|1479899459947|2016-11-23| |5ff1eb6c-14ec-471...|1479901374026|2016-11-23| ++-+--+ Thanks Deepak On Mon, Dec 5, 2016 at 1:46 PM, Deepak Sharma wrote: > This is how you can do it

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
on, Dec 5, 2016 at 1:49 PM, Deepak Sharma wrote: > This is the correct way to do it.The timestamp that you mentioned was not > correct: > > scala> val ts1 = from_unixtime($"ts"/1000, "-MM-dd") > ts1: org.apache.spark.sql.Column = fromunixtime((ts / 1000),-MM

foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
Hi All, I am iterating over data frame's paritions using df.foreachPartition . Upon each iteration of row , i am initializing DAO to insert the row into cassandra. Each of these iteration takes almost 1 and half minute to finish. In my workflow , this is part of an action and 100 partitions are bei

Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
There are 8 worker nodes in the cluster . Thanks Deepak On Dec 18, 2016 2:15 AM, "Holden Karau" wrote: > How many workers are in the cluster? > > On Sat, Dec 17, 2016 at 12:23 PM Deepak Sharma > wrote: > >> Hi All, >> I am iterating over data frame&

Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
On Sun, Dec 18, 2016 at 2:26 AM, vaquar khan wrote: > select * from indexInfo; > Hi Vaquar I do not see CF with the name indexInfo in any of the cassandra databases. Thank Deepak -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Deepak Sharma
You can read the source in a data frame. Then iterate over all rows with map and use something like below: df.map(x=>x(0).toString().toDouble) Thanks Deepak On Tue, Dec 20, 2016 at 3:05 PM, big data wrote: > our source data are string-based data, like this: > col1 col2 col3 ... > aaa bbb

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
Hi Mich You can copy the jar to shared location and use --jars command line argument of spark-submit. Who so ever needs access to this jar , can refer to the shared path and access it using --jars argument. Thanks Deepak On Tue, Dec 27, 2016 at 3:03 PM, Mich Talebzadeh wrote: > When one runs i

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
> from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 27 December 2016 at 09:52, Deepak Sharma wrote: > >> Hi Mich

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 27 December 2016 at 10:30, Deepak Sharma wrote: > >> It works for me with spark 1.6 (--jars) >> Please try this: >> ADD_JARS="<>" spark-shell

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all res

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Deepak Sharma
On the sqlcontext or hivesqlcontext , you can register the function as udf below: *hiveSqlContext.udf.register("func_name",func(_:String))* Thanks Deepak On Wed, Jan 18, 2017 at 8:45 AM, Sirisha Cheruvu wrote: > Hey > > Can yu send me the source code of hive java udf which worked in spark sql >

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Deepak Sharma
Did you tried this with spark-shell? Please try this. $spark-shell --jars /home/cloudera/Downloads/genudnvl2.jar On the spark shell: val hc = new org.apache.spark.sql.hive.HiveContext(sc) ; hc.sql("create temporary function nexr_nvl2 as ' com.nexr.platform.hive.udf.GenericUDFNVL2'"); hc.sql("selec

Re: Spark ANSI SQL Support

2017-01-17 Thread Deepak Sharma
>From spark documentation page: Spark SQL can now run all 99 TPC-DS queries. On Jan 18, 2017 9:39 AM, "Rishabh Bhardwaj" wrote: > Hi All, > > Does Spark 2.0 Sql support full ANSI SQL query standards? > > Thanks, > Rishabh. >

  1   2   >