HBASE

2016-03-09 Thread Ashok Kumar
Hi Gurus, I am relatively new to Big Data and know some about Spark and Hive. I was wondering do I need to pick up skills on Hbase as well. I am not sure how it works but know that it is kind of columnar NoSQL database. I know it is good to know something new in Big Data space. Just wondering if

Spark configuration with 5 nodes

2016-03-10 Thread Ashok Kumar
  Hi, We intend  to use 5servers which will be utilized for building Bigdata Hadoop data warehousesystem (not using any propriety distribution like Hortonworks or Cloudera orothers).All servers configurations are 512GB RAM, 30TB storageand 16 cores, Ubuntu Linux servers. Hadoop will be installed

shuffle in spark

2016-03-14 Thread Ashok Kumar
experts, please I need to understand how shuffling works in Spark and which parameters influence it. I am sorry but my knowledge of shuffling is very limited. Need a practical use case if you can. regards

Setting up spark to run on two nodes

2016-03-19 Thread Ashok Kumar
Experts. Please your valued advice. I have spark 1.5.2 set up as standalone for now and I have started the master as below start-master.sh I also have modified config/slave file to have  # A Spark Worker will be started on each of the machines listed below. localhostworkerhost On the localhost I

reading csv file, operation on column or columns

2016-03-20 Thread Ashok Kumar
Gurus, I would like to read a csv file into a Data Frame but able to rename the column name, change a column type from String to Integer or drop the column from further analysis before saving data as parquet file? Thanks

calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar
Gurus, If I register a temporary table as below  r.toDFres58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4: double, _5: double] r.toDF.registerTempTable("items") sql("select * from items")res60: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4:

Re: calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar
.(firstcolumn) in above when mapping if possible so columns will have labels On Thursday, 24 March 2016, 0:18, Michael Armbrust wrote: You probably need to use `backticks` to escape `_1` since I don't think that its a valid SQL identifier. On Wed, Mar 23, 2016 at 5:10 PM, Ash

Re: calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar
reason is that for maps, we have to >actually materialize and object to pass to your function.  However, if you >stick to column expression we can actually work directly on serialized data. On Wed, Mar 23, 2016 at 5:27 PM, Ashok Kumar wrote: thank you sir sql("select `_1` as firstcolumn fro

Finding out the time a table was created

2016-03-25 Thread Ashok Kumar
Experts, I would like to know when a table was created in Hive database using Spark shell? Thanks

Re: Finding out the time a table was created

2016-03-25 Thread Ashok Kumar
umns:    []   ] [Sort Columns:  []   ] [Storage Desc Params:    ] [   serialization.format    1   ] HTH Dr Mich Talebzadeh LinkedIn   https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw  http://talebzade

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-28 Thread Ashok Kumar
Hello Mich If you accommodate can you please share your approach to steps 1-3 above.  Best regards On Sunday, 27 March 2016, 14:53, Mich Talebzadeh wrote: Pretty simple as usual it is a combination of ETL and ELT. Basically csv files are loaded into staging directory on host, compresse

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-28 Thread Ashok Kumar
ot;===  Checking that all files are moved to hdfs staging directory" hdfs dfs -ls ${DIR} exit 0HTH Dr Mich Talebzadeh LinkedIn   https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw  http://talebzadehmich.wordpress.com  On 28 March 2016 at 22:24, Ashok Kumar wro

Spark and N-tier architecture

2016-03-29 Thread Ashok Kumar
Experts, One of terms used and I hear is N-tier architecture within Big Data used for availability, performance etc. I also hear that Spark by means of its query engine and in-memory caching fits into middle tier (application layer) with HDFS and Hive may be providing the data tier.  Can someone

Re: Spark and N-tier architecture

2016-03-29 Thread Ashok Kumar
concept or metastore. Thus it relies on others to provide that service. HTH Dr Mich Talebzadeh LinkedIn   https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw  http://talebzadehmich.wordpress.com  On 29 March 2016 at 22:07, Ashok Kumar wrote: Experts, One of terms used

Spark process creating and writing to a Hive ORC table

2016-03-31 Thread Ashok Kumar
Hello, How feasible is to use Spark to extract csv files and creates and writes the content to an ORC table in a Hive database. Is Parquet file the best (optimum) format to write to HDFS from Spark app. Thanks

Working out SQRT on a list

2016-04-02 Thread Ashok Kumar
Hi  I like a simple sqrt operation on a list but I don't get the result scala val l = List (1,5,786,25)l: List[Int] = List(1, 5, 786, 25) scala> l.map(x => x * x)res42: List[Int] = List(1, 25, 617796, 625) scala> l.map(x => x * x).sqrt:28: error: value sqrt is not a member of List[Int]           

difference between simple streaming and windows streaming in spark

2016-04-07 Thread Ashok Kumar
Is simple streaming mean continuous streaming and windows streaming time window? val ssc = new StreamingContext(sparkConf, Seconds(10)) thanks

Copying all Hive tables from Prod to UAT

2016-04-08 Thread Ashok Kumar
Hi, Anyone has suggestions how to create and copy Hive and Spark tables from Production to UAT. One way would be to copy table data to external files and then move the external files to a local target directory and populate the tables in target Hive with data. Is there an easier way of doing so?

Spark GUI, Workers and Executors

2016-04-09 Thread Ashok Kumar
On Spark GUI I can see the list of Workers. I always understood that workers are used by executors. What is the relationship between workers and executors please. Is it one to one? Thanks

Spark replacing Hadoop

2016-04-14 Thread Ashok Kumar
Hi, I hear that some saying that Hadoop is getting old and out of date and will be replaced by Spark! Does this make sense and if so how accurate is it? Best

Re: Spark replacing Hadoop

2016-04-14 Thread Ashok Kumar
?   David   From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID] Sent: Thursday, April 14, 2016 2:13 PM To: User Subject: Spark replacing Hadoop   Hi,   I hear that some saying that Hadoop is getting old and out of date and will be replaced by Spark!   Does this make sense and if so how

Invoking SparkR from Spark shell

2016-04-20 Thread Ashok Kumar
Hi, I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R with Spark. Is there a s hell similar to spark-shell that supports R besides Scala please? Thanks

Re: Defining case class within main method throws "No TypeTag available for Accounts"

2016-04-25 Thread Ashok Kumar
Thanks Michael as I gathered for now it is a feature. On Monday, 25 April 2016, 18:36, Michael Armbrust wrote: When you define a class inside of a method, it implicitly has a pointer to the outer scope of the method.  Spark doesn't have access to this scope, so this makes it hard (imp

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar
hi, so if i have 10gb of streaming data coming in does it require 10gb of memory in each node? also in that case why do we need using dstream.cache() thanks On Monday, 9 May 2016, 9:58, Saisai Shao wrote: It depends on you to write the Spark application, normally if data is already on

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar
understanding you don't need to call cache() again. On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar wrote: hi, so if i have 10gb of streaming data coming in does it require 10gb of memory in each node? also in that case why do we need using dstream.cache() thanks On Monday, 9 May 2016,

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar
2016, 10:49, Saisai Shao wrote: Pease see the inline comments. On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar wrote: Thank you. So If I create spark streaming then - The streams will always need to be cached? It cannot be stored in persistent storage You don't need to cache

Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Ashok Kumar
Hi Dr Mich, I will be very keen to have a look at it and review if possible. Please forward me a copy Thanking you warmly On Thursday, 12 May 2016, 11:08, Mich Talebzadeh wrote: Hi Al,, Following the threads in spark forum, I decided to write up on configuration of Spark including all

Spark handling spill overs

2016-05-12 Thread Ashok Kumar
Hi, How one can avoid having Spark spill over after filling the node's memory. Thanks

Monitoring Spark application progress

2016-05-16 Thread Ashok Kumar
Hi, I would like to know the approach and tools please to get the full performance for a Spark app running through Spark-shell and Spark-sumbit - Through Spark GUI at 4040? - Through OS utilities top, SAR  - Through Java tools like jbuilder etc - Through integration Spark with moni

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Ashok Kumar
Hi Dr Mich, This is very good news. I will be interested to know how Hive engages with Spark as an engine. What Spark processes are used to make this work?  Thanking you On Monday, 23 May 2016, 19:01, Mich Talebzadeh wrote: Have a look at this thread Dr Mich Talebzadeh LinkedIn   https

Using Java in Spark shell

2016-05-24 Thread Ashok Kumar
Hello, A newbie question. Is it possible to use java code directly in spark shell without using maven to build a jar file? How can I switch from scala to java in spark shell? Thanks

Does Spark support updates or deletes on underlying Hive tables

2016-05-30 Thread Ashok Kumar
Hi, I can do inserts from Spark on Hive tables. How about updates or deletes. They are failing when I tried? Thanking

processing twitter data

2016-05-31 Thread Ashok Kumar
hi all, i know very little about the subject. we would like to get streaming data from twitter and facebook. so questions please may i - what format is data from twitter. is it jason format - can i use spark and spark streaming for analyzing data - can data be fed in/streamed via kafka

Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
Hi all, Appreciate any advice on this. It is about scala I have created a very basic Utilities.scala that contains a test class and method. I intend to add my own classes and methods as I expand and make references to these classes and methods in my other apps class getCheckpointDirectory {  def

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
org.apache.log4j.Loggerimport org.apache.log4j.Levelimport ? Thanks  On Sunday, 5 June 2016, 15:21, Ted Yu wrote: At compilation time, you need to declare the dependence on  getCheckpointDirectory. At runtime, you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to pas

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
on the nete.g. http://www.scala-sbt.org/0.13/docs/Scala-Files-Example.html For #2, import . getCheckpointDirectory Cheers On Sun, Jun 5, 2016 at 8:36 AM, Ashok Kumar wrote: Thank you sir. At compile time can I do something similar to libraryDependencies += "org.apache.spark" %% "spar

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
cala/TwitterAnalyzer/build.sbt#L18-19)[warn]            +- scala:scala_2.10:1.0sbt.ResolveException: unresolved dependency: com.databricks#apps.twitter_classifier;1.0.0: not found Any ideas? regards, On Sunday, 5 June 2016, 22:22, Jacek Laskowski wrote: On Sun, Jun 5, 2016 at 9:01 PM

Fw: Basic question on using one's own classes in the Scala app

2016-06-06 Thread Ashok Kumar
Anyone can help me with this please On Sunday, 5 June 2016, 11:06, Ashok Kumar wrote: Hi all, Appreciate any advice on this. It is about scala I have created a very basic Utilities.scala that contains a test class and method. I intend to add my own classes and methods as I expand and

Running Spark in Standalone or local modes

2016-06-11 Thread Ashok Kumar
Hi, What is the difference between running Spark in Local mode or standalone mode? Are they the same. If they are not which is best suited for non prod work. I am also aware that one can run Spark in Yarn mode as well. Thanks

Re: Running Spark in Standalone or local modes

2016-06-11 Thread Ashok Kumar
er of launching a standalone cluster either manually(by starting a master and workers by hand), or by using the launch scripts provided with Spark package.  You can find more on this here. HTH |   | | | Tariq, Mohammad | about.me/mti | | | |   | |   | On Sat, Jun 11, 2016 at 11:38 PM, Ashok

Re: Running Spark in Standalone or local modes

2016-06-12 Thread Ashok Kumar
cluster managements. Local mode is actually a standalone mode which everything runs on the single local machine instead of remote clusters. That is my understanding. On Sat, Jun 11, 2016 at 12:40 PM, Ashok Kumar wrote: Thank you for grateful I know I can start spark-shell by launching the

Running Spark in local mode

2016-06-19 Thread Ashok Kumar
Hi, I have been told Spark in Local mode is simplest for testing. Spark document covers little on local mode except the cores used in --master local[k].  Where are the the driver program, executor and resources. Do I need to start worker threads and how many app I can use safely without exceeding

Re: Running Spark in local mode

2016-06-19 Thread Ashok Kumar
and one executor with `k` threads.https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala#L94 // maropu On Sun, Jun 19, 2016 at 5:39 PM, Ashok Kumar wrote: Hi, I have been told Spark in Local mode is simplest for testing

Re: Running Spark in local mode

2016-06-19 Thread Ashok Kumar
Thank you all sirs Appreciated Mich your clarification. On Sunday, 19 June 2016, 19:31, Mich Talebzadeh wrote: Thanks Jonathan for your points I am aware of the fact yarn-client and yarn-cluster are both depreciated (still work in 1.6.1), hence the new nomenclature. Bear in mind this

JDBC load into tempTable

2016-06-20 Thread Ashok Kumar
Hi, I have a SQL server table with 500,000,000 rows with primary key (unique clustered index) on ID column If I load it through JDBC into a DataFrame and register it via  registerTempTable will the data will be in the order of ID in tempTable? Thanks

latest version of Spark to work OK as Hive engine

2016-07-02 Thread Ashok Kumar
Hi, Looking at this presentation Hive on Spark is Blazing Fast .. Which latest version of Spark can run as an engine for Hive please? Thanks P.S. I am aware of  Hive on TEZ but that is not what I am interested here please Warmest regards

ORC or parquet with Spark

2016-07-03 Thread Ashok Kumar
With Spark caching which file format is best to use parquet or ORC Obviously ORC can be used with Hive.  My question is whether Spark can use various file, stripe rowset statistics stored in ORC file? Otherwise to me both parquet and ORC are files simply kept on HDFS. They do not offer any cachin

Spark as sql engine on S3

2016-07-07 Thread Ashok Kumar
Hello gurus, We are storing data externally on Amazon S3 What is the optimum or best way to use Spark as SQL engine to access data on S3? Any info/write up will be greatly appreciated. Regards

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-07 Thread Ashok Kumar
Thanks. Will this presentation recorded as well? Regards On Wednesday, 6 July 2016, 22:38, Mich Talebzadeh wrote: Dear forum members I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, your mileage varies" in Future of Data: London DetailsOrganized by: Horton

Re: Spark as sql engine on S3

2016-07-07 Thread Ashok Kumar
jdbc tool like squirrel On Fri, Jul 8, 2016 at 3:50 AM, Ashok Kumar wrote: Hello gurus, We are storing data externally on Amazon S3 What is the optimum or best way to use Spark as SQL engine to access data on S3? Any info/write up will be greatly appreciated. Regards -- Best Regards, Ayan

Re: Spark as sql engine on S3

2016-07-08 Thread Ashok Kumar
is that using Spark SQL will be much faster? regards On Friday, 8 July 2016, 6:30, ayan guha wrote: Yes, it can.  On Fri, Jul 8, 2016 at 3:03 PM, Ashok Kumar wrote: thanks so basically Spark Thrift Server runs on a port much like beeline that uses JDBC to connect to Hive? Can Spark thr

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Ashok Kumar
Hi Mich, Your recent presentation in London on this topic "Running Spark on Hive or Hive on Spark" Have you made any more interesting findings that you like to bring up? If Hive is offering both Spark and Tez in addition to MR, what stopping one not to use Spark? I still don't get why TEZ + LLAP

Re: Fast database with writes per second and horizontal scaling

2016-07-11 Thread Ashok Kumar
Any expert advice warmly acknowledged.. thanking yo On Monday, 11 July 2016, 17:24, Ashok Kumar wrote: Hi Gurus, Advice appreciated from Hive gurus. My colleague has been using Cassandra. However, he says it is too slow and not user friendly/MongodDB as a doc databases is pretty neat

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-19 Thread Ashok Kumar
Thanks Mich looking forward to it :) On Tuesday, 19 July 2016, 19:13, Mich Talebzadeh wrote: Hi all, This will be in London tomorrow Wednesday 20th July starting at 18:00 hour for refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf Station, Jubilee Line  If you wish y

The main difference use case between orderBY and sort

2016-07-29 Thread Ashok Kumar
Hi, In Spark programing I can use df.filter(col("transactiontype") === "DEB").groupBy("transactiondate").agg(sum("debitamount").cast("Float").as("Total Debit Card")).orderBy("transactiondate").show(5) or df.filter(col("transactiontype") === "DEB").groupBy("transactiondate").agg(sum("debitamount"

Windows operation orderBy desc

2016-08-01 Thread Ashok Kumar
Hi, in the following Window spec I want orderBy ("") to be displayed in descending order please val W = Window.partitionBy("col1").orderBy("col2") If I Do val W = Window.partitionBy("col1").orderBy("col2".desc) It throws error console>:26: error: value desc is not a member of String How can I ac

num-executors, executor-memory and executor-cores parameters

2016-08-04 Thread Ashok Kumar
Hi I would like to know the exact definition for these three  parameters  num-executors executor-memory executor-cores for local, standalone and yarn modes I have looked at on-line doc but not convinced if I understand them correct. Thanking you 

parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
Hi Gurus, I have few large tables in rdbms (ours is Oracle). We want to access these tables through Spark JDBC What is the quickest way of getting data into Spark Dataframe say multiple connections from Spark thanking you

Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
ll in no case be liable for any monetary damages arising from suchloss, damage or destruction.   On 14 August 2016 at 20:50, Ashok Kumar wrote: Hi Gurus, I have few large tables in rdbms (ours is Oracle). We want to access these tables through Spark JDBC What is the quickest way of getting data

Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
m Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising f

Spark standalone or Yarn for resourcing

2016-08-17 Thread Ashok Kumar
Hi, for small to medium size clusters I think Spark Standalone mode is a good choice. We are contemplating moving to Yarn as our cluster grows.  What are the pros and cons of using either please. Which one offers the best Thanking you

How to run Scala file examples in spark 1.5.2

2016-02-15 Thread Ashok Kumar
Gurus, I am trying to run some examples given under directory examples spark/examples/src/main/scala/org/apache/spark/examples/ I am trying to run HdfsTest.scala However, when I run HdfsTest.scala  against spark shell it comes back with error Spark context available as sc. SQL context available a

Use case for RDD and Data Frame

2016-02-16 Thread Ashok Kumar
Gurus, What are the main differences between a Resilient Distributed Data (RDD) and Data Frame (DF) Where one can use RDD without transforming it to DF? Regards and obliged

How to get the code for class in spark

2016-02-19 Thread Ashok Kumar
Hi, If I define a class in Scala like case class(col1: String, col2:Int,...) and it is created how would I be able to see its description anytime Thanks

Re: How to get the code for class in spark

2016-02-19 Thread Ashok Kumar
Hi, class body thanks On Friday, 19 February 2016, 11:23, Ted Yu wrote: Can you clarify your question ? Did you mean the body of your class ? On Feb 19, 2016, at 4:43 AM, Ashok Kumar wrote: Hi, If I define a class in Scala like case class(col1: String, col2:Int,...) and it is

install databricks csv package for spark

2016-02-19 Thread Ashok Kumar
Hi, I downloaded the zipped csv libraries from databricks/spark-csv |   | |   | |   |   |   |   |   | | databricks/spark-csvspark-csv - CSV data source for Spark SQL and DataFrames | | | | View on github.com | Preview by Yahoo | | | |   | Now I have a directory created called spark-csv-master

Re: install databricks csv package for spark

2016-02-19 Thread Ashok Kumar
  http://spark.apache.org/docs/latest/submitting-applications.html useful. On Fri, Feb 19, 2016 at 7:26 AM, Ashok Kumar wrote: Hi, I downloaded the zipped csv libraries from databricks/spark-csv |   | |   | |   |   |   |   |   | | databricks/spark-csvspark-csv - CSV data source for Spark SQL and

Execution plan in spark

2016-02-24 Thread Ashok Kumar
Gurus, Is there anything like explain in Spark to see the execution plan in functional programming? warm regards

Re: Execution plan in spark

2016-02-24 Thread Ashok Kumar
'    )""")     table("partitionedParquet").explain(true) On Wed, Feb 24, 2016 at 1:16 AM, Ashok Kumar wrote: Gurus, Is there anything like explain in Spark to see the execution plan in functional programming? warm regards

Filter on a column having multiple values

2016-02-24 Thread Ashok Kumar
Hi, I would like to do the following select count(*) from where column1 in (1,5)) I define scala> var t = HiveContext.table("table") This workst.filter($"column1" ===1) How can I expand this to have column1  for both 1 and 5 please? thanks

select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread Ashok Kumar
Hi, What is the equivalent of this in Spark please select * from mytable where column1 in (select max(column1) from mytable) Thanks

d.filter("id in max(id)")

2016-02-25 Thread Ashok Kumar
Hi, How can I make that work? val d = HiveContext.table("table") select * from table where ID = MAX(ID) from table Thanks

Clarification on RDD

2016-02-26 Thread Ashok Kumar
Hi, Spark doco says Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs example: val textFile = sc.textFile("README.md") my question is when

Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar
Hello, I like to be able to solve this using arrays. I have two dimensional array of (String,Int) with 5  entries say arr("A",20), arr("B",13), arr("C", 18), arr("D",10), arr("E",19) I like to write a small code to order these in the order of highest Int column so I will have arr("A",20), arr("E

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar
gisterTempTable("test") scala> val df = sql("SELECT struct(id, b, a) from test order by b")df: org.apache.spark.sql.DataFrame = [struct(id, b, a): struct] scala> df.show++|struct(id, b, a)|+----+|       [2,foo,a]||      [1,test,b]|+--

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Ashok Kumar
no particular reason. just wanted to know if there was another way as well. thanks On Saturday, 27 February 2016, 22:12, Yin Yang wrote: Is there particular reason you cannot use temporary table ? Thanks On Sat, Feb 27, 2016 at 10:59 AM, Ashok Kumar wrote: Thank you sir. Can one do

Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Ashok Kumar
  Hi Gurus, Appreciate if you recommend me a good book on Spark or documentation for beginner to moderate knowledge I very much like to skill myself on transformation and action methods. FYI, I have already looked at examples on net. However, some of them not clear at least to me. Warmest regards

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-29 Thread Ashok Kumar
Thank you all for valuable advice. Much appreciated Best On Sunday, 28 February 2016, 21:48, Ashok Kumar wrote:   Hi Gurus, Appreciate if you recommend me a good book on Spark or documentation for beginner to moderate knowledge I very much like to skill myself on transformation and

Converting array to DF

2016-03-01 Thread Ashok Kumar
Hi, I have this val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)) weights.toDF("weights","value") I want to convert the Array to DF but I get thisor weights: Array[(String, Int)] = Array((a,3), (b,2), (c,5), (d,1), (e,9), (f,4), (g,6)) :33: error: value to

Re: Converting array to DF

2016-03-01 Thread Ashok Kumar
quot;)).collect.foreach(println) Please why Array did not work? On Tuesday, 1 March 2016, 8:51, Jeff Zhang wrote: Change Array to Seq and import sqlContext.implicits._ On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar wrote: Hi, I have this val weights = Array(("a", 3), ("b"

Re: Converting array to DF

2016-03-01 Thread Ashok Kumar
On Tuesday, 1 March 2016, 20:52, Shixiong(Ryan) Zhu wrote: For Array, you need to all `toSeq` at first. Scala can convert Array to ArrayOps automatically. However, it's not a `Seq` and you need to call `toSeq` explicitly. On Tue, Mar 1, 2016 at 1:02 AM, Ashok Kumar wrote: Thank y

New to Spark

2015-12-01 Thread Ashok Kumar
Hi, I am new to Spark. I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables. I have successfully made Hive metastore to be used by Spark. In spark-sql I can see the DDL for Hive tables. However, when I do select count(1) from HIVE_TABLE it always returns zero rows. If I create

New to Spark

2015-12-01 Thread Ashok Kumar
Hi, I am new to Spark. I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables. I have successfully made Hive metastore to be used by Spark. In spark-sql I can see the DDL for Hive tables. However, when I do select count(1) from HIVE_TABLE it always returns zero rows. If I creat

Re: Managed to make Hive run on Spark engine

2015-12-07 Thread Ashok Kumar
This is great news sir. It shows perseverance pays at last. Can you inform us when the write-up is ready so I can set it up as well please. I know a bit about the advantages of having Hive using Spark engine. However, the general question I have is when one should use Hive on spark as opposed to

Design patterns involving Spark

2016-08-28 Thread Ashok Kumar
Hi, There are design patterns that use Spark extensively. I am new to this area so I would appreciate if someone explains where Spark fits in especially within faster or streaming use case. What are the best practices involving Spark. Is it always best to deploy it for processing engine,  For ex

Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Ashok Kumar
Hi, What are practical differences between the new Data set in Spark 2 and the existing DataFrame. Has Dataset replaced Data Frame and what advantages it has if I use Data Frame instead of Data Frame. Thanks

Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
Hi, I have a text file as below that I read in 74,20160905-133143,98.1121806912882759414875,20160905-133143,49.5277699881591680774276,20160905-133143,56.0802995712398098455677,20160905-133143,46.636895265444075228,20160905-133143,84.8822714116440218155179,20160905-133143,68.72408602520662115000

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
sformations like map.sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: Hi, I have a text file as below that I read in 74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
RDD of arrays. What is your expected outcome of 2nd map?  On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar wrote: Thank you sir. This is what I get scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at map at :27 How can I

Re: Splitting columns from a text file

2016-09-05 Thread Ashok Kumar
t it to your desired data type and then use filter.  On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar wrote: Hi,I want to filter them for values. This is what is in array 74,20160905-133143,98. 11218069128827594148 I want to filter anything > 50.0 in the third column Thanks On Monday, 5 Septe

Getting figures from spark streaming

2016-09-06 Thread Ashok Kumar
Hello Gurus, I am creating some figures and feed them into Kafka and then spark streaming. It works OK but I have the following issue. For now as a test I sent 5 prices in each batch interval. In the loop code this is what is hapening       dstream.foreachRDD { rdd =>     val x= rdd.count      i +

Re: Getting figures from spark streaming

2016-09-07 Thread Ashok Kumar
Any help on this warmly appreciated. On Tuesday, 6 September 2016, 21:31, Ashok Kumar wrote: Hello Gurus, I am creating some figures and feed them into Kafka and then spark streaming. It works OK but I have the following issue. For now as a test I sent 5 prices in each batch interval

dstream.foreachRDD iteration

2016-09-07 Thread Ashok Kumar
Hi, A bit confusing to me How many layers involved in DStream.foreachRDD. Do I need to loop over it more than once? I mean  DStream.foreachRDD{ rdd = > } I am trying to get individual lines in RDD. Thanks

Re: dstream.foreachRDD iteration

2016-09-07 Thread Ashok Kumar
mages arising from suchloss, damage or destruction.   On 7 September 2016 at 11:39, Ashok Kumar wrote: Hi, A bit confusing to me How many layers involved in DStream.foreachRDD. Do I need to loop over it more than once? I mean  DStream.foreachRDD{ rdd = > } I am trying to get individual lines in RDD. Thanks

Spark Interview questions

2016-09-14 Thread Ashok Kumar
Hi, As a learner I appreciate if you have typical Spark interview questions for Spark/Scala junior roles that you can please forward to me. I will be very obliged

Re: Design considerations for batch and speed layers

2016-09-30 Thread Ashok Kumar
Can one design a fast pipeline with Kafka, Spark streaming and Hbase  or something similar? On Friday, 30 September 2016, 17:17, Mich Talebzadeh wrote: I have designed this prototype for a risk business. Here I would like to discuss issues with batch layer. Apologies about being lo

Pros and cons of using different persistence layers for Spark

2016-10-03 Thread Ashok Kumar
What are the pros and cons of using different persistence layers for Spark, such as S3,Cassandra, and HDFS? Thanks

Re: Happy Diwali to those forum members who celebrate this great festival

2016-10-30 Thread Ashok Kumar
You are very kind Sir On Sunday, 30 October 2016, 16:42, Devopam Mittra wrote: +1 Thanks and regards Devopam On 30 Oct 2016 9:37 pm, "Mich Talebzadeh" wrote: Enjoy the festive season. Regards, Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/ profile/view?id= AAEWh2gBxianrbJd

High Availability/DR options for Spark applications

2017-02-05 Thread Ashok Kumar
Hello, What are the practiced High Availability/DR operations for Spark cluster at the moment. I am specially interested if YARN is used as the resource manager. Thanks

Re: High Availability/DR options for Spark applications

2017-02-05 Thread Ashok Kumar
llow me at https://twitter.com/jaceklaskowski On Sun, Feb 5, 2017 at 10:11 AM, Ashok Kumar wrote: > Hello, > > What are the practiced High Availability/DR operations for Spark cluster at > the moment. I am specially interested if YARN is

  1   2   >