Re: Re: how to distributed run a bash shell in spark

2015-05-24 Thread madhu phatak
Hi, You can use pipe operator, if you are running shell script/perl script on some data. More information on my blog . Regards, Madhukara Phatak http://datamantra.io/ On Mon, May 25, 2015 at 8:02 AM, wrote: > Thanks Akhil, > >your c

Talk on Deep dive into Spark Data source API

2015-06-30 Thread madhu phatak
Hi, Recently I gave a talk on how to create spark data sources from scratch. Screencast of the same is available on Youtube with slides and code. Please have a look if you are interested. http://blog.madhukaraphatak.com/anatomy-of-spark-datasource-api/ -- Regards, Madhukara Phatak http://dataman

Running mllib from R in Spark 1.4

2015-07-15 Thread madhu phatak
Hi, I have been playing with Spark R API that is introduced in Spark 1.4 version. Can we use any mllib functionality from the R as of now?. From the documentation it looks like we can only use SQL/Dataframe functionality as of now. I know there is separate project SparkR project but it doesnot seem

Re: Running mllib from R in Spark 1.4

2015-07-15 Thread madhu phatak
ra/browse/SPARK-6823 > > Best, > Burak > > On Wed, Jul 15, 2015 at 6:00 AM, madhu phatak > wrote: > >> Hi, >> I have been playing with Spark R API that is introduced in Spark 1.4 >> version. Can we use any mllib functionality from the R as of now?. From

Spark structured streaming is Micro batch?

2016-05-06 Thread madhu phatak
Hi, As I was playing with new structured streaming API, I noticed that spark starts processing as and when the data appears. It's no more seems like micro batch processing. Is spark structured streaming will be an event based processing? -- Regards, Madhukara Phatak http://datamantra.io/

Re: Spark structured streaming is Micro batch?

2016-05-06 Thread madhu phatak
or transform. >> I have not used it yet but seems it will be like start streaming data >> from source as son as you define it. >> >> Thanks >> Deepak >> >> >> On Fri, May 6, 2016 at 1:37 PM, madhu phatak >> wrote: >> >>> Hi, >>>

Talk on Deep dive in Spark Dataframe API

2015-08-05 Thread madhu phatak
Hi, Recently I gave a talk on a deep dive into data frame api and sql catalyst . Video of the same is available on Youtube with slides and code. Please have a look if you are interested. *http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/

Re: How to create SparkSession using SparkConf?

2017-04-28 Thread madhu phatak
SparkSession.builder.config() takes SparkConf as parameter. You can use that to pass SparkConf as it is. https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/SparkSession.Builder.html#config(org.apache.spark.SparkConf) On Fri, Apr 28, 2017 at 11:40 AM, Yanbo Liang wrote: > Streamin

Time window on Processing Time

2017-08-28 Thread madhu phatak
Hi, As I am playing with structured streaming, I observed that window function always requires a time column in input data.So that means it's event time. Is it possible to old spark streaming style window function based on processing time. I don't see any documentation on the same. -- Regards, M

Re: Time window on Processing Time

2017-08-30 Thread madhu phatak
ark.sql.functions._ > > ds.withColumn("processingTime", current_timestamp()) > .groupBy(window("processingTime", "1 minute")) > .count() > > > On Mon, Aug 28, 2017 at 5:46 AM, madhu phatak > wrote: > >> Hi, >> As I am playing with structure

Re: save a histogram to a file

2015-01-23 Thread madhu phatak
Hi, histogram method return normal scala types not a RDD. So you will not have saveAsTextFile. You can use makeRDD method make a rdd out of the data and saveAsObject file val hist = a.histogram(10) val histRDD = sc.makeRDD(hist) histRDD.saveAsObjectFile(path) On Fri, Jan 23, 2015 at 5:37 AM, SK

Re: Streaming: getting data from Cassandra based on input stream values

2015-01-23 Thread madhu phatak
okup some ids > in the table. So it doesn't make sense for me how I can put some values > from the streamRDD to the cassandra query (to "where" method). > > Greg > > > > On 1/23/15 1:11 AM, madhu phatak wrote: > > Hi, > Seems like you want to get user

why generateJob is a private API?

2015-03-15 Thread madhu phatak
Hi, I am trying to create a simple subclass of DStream. If I understand correctly, I should override *compute *lazy operations and *generateJob* for actions. But when I try to override, generateJob it gives error saying method is private to the streaming package. Is my approach is correct or am I

Re: Need Advice about reading lots of text files

2015-03-15 Thread madhu phatak
Hi, Internally Spark uses HDFS api to handle file data. Have a look at HAR, Sequence file input format. More information on this cloudera blog . Regards, Madhukara Phatak http://datamantra.io/ On Sun, Mar 15, 2015 at 9:59 PM, Pat Fer

MappedStream vs Transform API

2015-03-16 Thread madhu phatak
Hi, Current implementation of map function in spark streaming looks as below. def map[U: ClassTag](mapFunc: T => U): DStream[U] = { new MappedDStream(this, context.sparkContext.clean(mapFunc)) } It creates an instance of MappedDStream which is a subclass of DStream. The same function can

Re: MappedStream vs Transform API

2015-03-16 Thread madhu phatak
and easy to understand. > > > > Thanks > > Jerry > > > > *From:* madhu phatak [mailto:phatak@gmail.com] > *Sent:* Monday, March 16, 2015 4:32 PM > *To:* user@spark.apache.org > *Subject:* MappedStream vs Transform API > > > > Hi, > >

Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
rn about DStreams. > > TD > > On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak > wrote: > >> Hi, >> Thanks for the response. I understand that part. But I am asking why the >> internal implementation using a subclass when it can use an existing api? >>

Re: why generateJob is a private API?

2015-03-17 Thread madhu phatak
> > On Sun, Mar 15, 2015 at 11:14 PM, madhu phatak > wrote: > >> Hi, >> I am trying to create a simple subclass of DStream. If I understand >> correctly, I should override *compute *lazy operations and *generateJob* >> for actions. But when I try to override, g

Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
nto same situation. But if it's not super essential then ok. > > If you are interested in contributing to Spark Streaming, i can point you > to a number of issues where your contributions will be more valuable. > Yes please. > > TD > > On Tue, Mar 17, 2015 at 1

Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
nto same situation. But if it's not super essential then ok. > > If you are interested in contributing to Spark Streaming, i can point you > to a number of issues where your contributions will be more valuable. > That will be great. > > TD > > On Tue, Mar 17,

Anatomy of RDD : Deep dive into RDD data structure

2015-03-31 Thread madhu phatak
Hi, Recently I gave a talk on RDD data structure which gives in depth understanding of spark internals. You can watch it on youtube . Also slides are on slideshare and code is on github

Re: HiveContext setConf seems not stable

2015-04-22 Thread madhu phatak
Hi, calling getConf don't solve the issue. Even many hive specific queries are broken. Seems like no hive configurations are getting passed properly. Regards, Madhukara Phatak http://datamantra.io/ On Wed, Apr 22, 2015 at 2:19 AM, Michael Armbrust wrote: > As a workaround, can you call getCo

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-23 Thread madhu phatak
Hi, Hive table creation need an extra step from 1.3. You can follow the following template df.registerTempTable(tableName) hc.sql(s"create table $tableName as select * from $tableName") this will save the table in hive with given tableName. Regards, Madhukara Phatak http://datamantra

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-23 Thread madhu phatak
Hi Michael, Here <https://issues.apache.org/jira/browse/SPARK-7084> is the jira issue and PR <https://github.com/apache/spark/pull/5654> for the same. Please have a look. Regards, Madhukara Phatak http://datamantra.io/ On Thu, Apr 23, 2015 at 1:22 PM, madhu phatak wrote: > Hi

Re: Is the Spark-1.3.1 support build with scala 2.8 ?

2015-04-23 Thread madhu phatak
Hi, AFAIK it's only build with 2.10 and 2.11. You should integrate kafka_2.10.0-0.8.0 to make it work. Regards, Madhukara Phatak http://datamantra.io/ On Fri, Apr 24, 2015 at 9:22 AM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Is the Spark-1.3.1 support build with scala 2.

Spark JDBC data source API issue with mysql

2015-04-27 Thread madhu phatak
Hi, I have been trying out spark data source api with JDBC. The following is the code to get DataFrame, Try(hc.load("org.apache.spark.sql.jdbc",Map("url" -> dbUrl,"dbtable"->s"($ query) "))) By looking at test cases, I found that query has to be inside brackets, otherwise it's treated as table

Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am running into issue that spark is taking huge amount of time to parse the sql and create a logical plan. Even if i have just one row, it's taking more than 1 hour just to get pass the parsing. Any i

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
, 2015 at 3:59 PM, ayan guha wrote: > can you kindly share your code? > > On Tue, May 19, 2015 at 8:04 PM, madhu phatak > wrote: > >> Hi, >> I am trying run spark sql aggregation on a file with 26k columns. No of >> rows is very small. I am running into issue that

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, An additional information is, table is backed by a csv file which is read using spark-csv from databricks. Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:05 PM, madhu phatak wrote: > Hi, > I have fields from field_0 to fied_26000. The query is sel

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
> On Tue, May 19, 2015 at 8:04 PM, madhu phatak > wrote: > >> Hi, >> I am trying run spark sql aggregation on a file with 26k columns. No of >> rows is very small. I am running into issue that spark is taking huge >> amount of time to parse the sql and create a lo

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, Tested for calculating values for 300 columns. Analyser takes around 4 minutes to generate the plan. Is this normal? Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:35 PM, madhu phatak wrote: > Hi, > I am using spark 1.3.1 > > > > > Regard

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
e)), count(*) )* aggregateStats is UDF generating case class to hold the values. Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 5:57 PM, madhu phatak wrote: > Hi, > Tested for calculating values for 300 columns. Analyser takes around 4 > minutes to generat

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
gards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 6:23 PM, madhu phatak wrote: > Hi, > Tested with HiveContext also. It also take similar amount of time. > > To make the things clear, the following is select clause for a given column > > > *aggregateStats(

Re: UNION two RDDs

2014-12-18 Thread madhu phatak
Hi, coalesce is an operation which changes no of records in a partition. It will not touch ordering with in a row AFAIK. On Fri, Dec 19, 2014 at 2:22 AM, Jerry Lam wrote: > > Hi Spark users, > > I wonder if val resultRDD = RDDA.union(RDDB) will always have records in > RDDA before records in RDDB

Re: Can we specify driver running on a specific machine of the cluster on yarn-cluster mode?

2014-12-18 Thread madhu phatak
Hi, The driver runs on the machine from where you did the spark-submit. You cannot change that. On Thu, Dec 18, 2014 at 3:44 PM, LinQili wrote: > > Hi all, > On yarn-cluster mode, can we let the driver running on a specific machine > that we choose in cluster ? Or, even the machine not in the cl

Re: SchemaRDD.sample problem

2014-12-18 Thread madhu phatak
Hi, Can you clean up the code lil bit better, it's hard to read what's going on. You can use pastebin or gist to put the code. On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren wrote: > > Hi, > > I am using SparkSQL on 1.2.1 branch. The problem comes froms the following > 4-line code: > > *val t1: SchemaR

Re: When will spark 1.2 released?

2014-12-18 Thread madhu phatak
It’s on Maven Central already http://search.maven.org/#browse%7C717101892 On Fri, Dec 19, 2014 at 11:17 AM, vboylin1...@gmail.com < vboylin1...@gmail.com> wrote: > > Hi, >Dose any know when will spark 1.2 released? 1.2 has many great feature > that we can't wait now ,-) > > Sincely > Lin wukan

Re: reading files recursively using spark

2014-12-19 Thread madhu phatak
Hi, You can use FileInputformat API of Hadoop and newApiHadoopFile of spark to get recursion. More on the topic you can refer here http://stackoverflow.com/questions/8114579/using-fileinputformat-addinputpaths-to-recursively-add-hdfs-path On Fri, Dec 19, 2014 at 4:50 PM, Sean Owen wrote: > > How

Re: broadcasting object issue

2014-12-22 Thread madhu phatak
Hi, Just ran your code on spark-shell. If you replace val bcA = sc.broadcast(a) with val bcA = sc.broadcast(new B().getA) it seems to work. Not sure why. On Tue, Dec 23, 2014 at 9:12 AM, Henry Hung wrote: > Hi All, > > > > I have a problem with broadcasting a serialize class object th

Re: Joins in Spark

2014-12-22 Thread madhu phatak
Hi, You can map your vertices rdd as follow val pairVertices = verticesRDD.map(vertice => (vertice,null)) the above gives you a pairRDD. After join make sure that you remove superfluous null value. On Tue, Dec 23, 2014 at 10:36 AM, Deep Pradhan wrote: > Hi, > I have two RDDs, vertices and edg

Re: DAG info

2015-01-03 Thread madhu phatak
Hi, You can turn off these messages using log4j.properties. On Fri, Jan 2, 2015 at 1:51 PM, Robineast wrote: > Do you have some example code of what you are trying to do? > > Robin > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/DAG-info-tp20940p2