from:"muthu"

StreamingContext.textFileStream(...)

2016-12-07 Thread muthu

ldLeft() like semantics I have to only combine the previously combined result with the result from the current time tn. Please advice, Muthu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-tp28183.html Sent from the Apache S

Fast write datastore...

2017-03-14 Thread muthu

perform simple filters and sort using ElasticSearch and for more complex aggregate, Spark Dataframe can come back to the rescue :). Please advice on other possible data-stores I could use? Thanks, Muthu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fast-write

DataFrame --- join / groupBy-agg question...

2017-07-11 Thread muthu

I may be having a naive question on join / groupBy-agg. During the days of RDD, whenever I wanted to perform a. groupBy-agg, I used to say reduceByKey (of PairRDDFunctions) with an optional Partition-Strategy (with is number of partitions or Partitioner) b. join (of PairRDDFunctions) and its varian

Re: DataFrame --- join / groupBy-agg question...

2017-07-19 Thread muthu

.partitions'. In ideal situations, we have a long running application that uses the same spark-session and runs one or more query using FAIR mode. Thanks, Muthu On Wed, Jul 19, 2017 at 6:03 AM, qihuagao [via Apache Spark User List] < ml+s1001560n28879...@n3.nabble.com> wrote: > also

Spark standalone API...

2017-08-29 Thread muthu

nsive APIs available to SparkListener interface that's available with-in every spark application. Please advice, Muthu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-API-tp29115.html Sent from the Apache Spark User List mailing list archiv

Re: Kill Spark Application programmatically in Spark standalone cluster

2017-08-29 Thread muthu

I had similar question in the past and worked around by having my spark-submit application to register to my master application in-order to co-ordinate kill and/or progress of execution. This is a bit clergy I suppose in comparison to a REST like API available in the spark stand-alone cluster.

Re: Spark submit OutOfMemory Error in local mode

2017-08-29 Thread muthu

Are you getting OutOfMemory on the driver or on the executor? Typical cause of OOM in Spark can be due to fewer number of tasks for a job. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-OutOfMemory-Error-in-local-mode-tp29081p29117.html Sent fr

[spark-core] docker-image-tool.sh question...

2021-03-09 Thread Muthu Jayakumar

build Please advice, Muthu

Re: [spark-core] docker-image-tool.sh question...

2021-03-10 Thread Muthu Jayakumar

3 months ago /bin/sh -c #(nop) ADD file:3a7bff4e139bcacc5… 69.2MB (2) $ docker run --entrypoint "/usr/local/openjdk-8/bin/java" 3ef86250a35b '-version' openjdk version "1.8.0_275" OpenJDK Runtime Environment (build 1.8.0_275-b01) OpenJDK 64-Bit Server

reading each JSON file from dataframe...

2022-07-10 Thread Muthu Jayakumar

`, `other_useful_id`, `json_content`, `file_path`. Assume that I already have the required HDFS url libraries in my classpath. Please advice, Muthu

Re: reading each JSON file from dataframe...

2022-07-12 Thread Muthu Jayakumar

sparkContext. I did try to use GCS Java API to read content, but ran into many JAR conflicts as the HDFS wrapper and the JAR library uses different dependencies. Hope this findings helps others as well. Thanks, Muthu On Mon, 11 Jul 2022 at 14:11, Enrico Minack wrote: > All you need to do

Re: reading each JSON file from dataframe...

2022-07-12 Thread Muthu Jayakumar

s make a difference regarding I guess, in the question above I do have to process row-wise and RDD may be more efficient? Thanks, Muthu On Tue, 12 Jul 2022 at 14:55, ayan guha wrote: > Another option is: > > 1. collect the dataframe with file path > 2. create a list of paths > 3. crea

Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Muthu Jayakumar

tnull(input[0, org.apache.spark.sql.Row, true], top level row object) +- input[0, org.apache.spark.sql.Row, true] Let me know if you would like me try to create a more simplified reproducer to this problem. Perhaps I should not be using Option[T] for nullable schema values? Please advice, Muthu

Re: Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Muthu Jayakumar

i don't know how to split and map the row elegantly. Hence using it as RDD. Thanks, Muthu On Thu, Jul 28, 2016 at 10:47 PM, Dong Meng wrote: > you can specify nullable in StructField > > On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar > wrote: > >> Hello there, &g

Dataframe / Dataset partition size...

2016-08-06 Thread Muthu Jayakumar

rations on this job. On a side note, I do understand that 200 parquet part files for the above 2.2 G seems over-kill for a 128 MB block size. Ideally it should be 18 parts or so. Please advice, Muthu

Re: Dataframe / Dataset partition size...

2016-08-06 Thread Muthu Jayakumar

spark.read.parquet(parquetFile).toJavaRDD.partitions.size() res2: Int = 20 Can I suspect something with dynamic allocation perhaps? Please advice, Muthu On Sat, Aug 6, 2016 at 3:23 PM, Mich Talebzadeh wrote: > 720 cores Wow. That is a hell of cores Muthu :) > > Ok let us take a step back > >

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-07 Thread Muthu Jayakumar

Hello Hao Ren, Doesn't the code... val add = udf { (a: Int) => a + notSer.value } Mean UDF function that Int => Int ? Thanks, Muthu On Sun, Aug 7, 2016 at 2:31 PM, Hao Ren wrote: > I am playing with spark 2.0 > What I tried to test is: > > Create a UDF

Re: Out of memory issue

2016-01-06 Thread Muthu Jayakumar

f.getFloat(ParquetOutputFormat.MEMORY_POOL_RATIO, MemoryManager.DEFAULT_MEMORY_POOL_RATIO);). I wonder how is this provided thru Apache Spark. Meaning, I see that 'TaskAttemptContext' seems to be the hint to provide this. But I am not able to find a way I could provide this configuration. Please advice,

Spark 1.6 udf/udaf alternatives in dataset?

2016-01-10 Thread Muthu Jayakumar

] method, how could I specify a custom encoder/translation to case class (where I don't have the same column-name mapping or same data-type mapping)? Please advice, Muthu

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-11 Thread Muthu Jayakumar

nd I think this may make it strongly typed? Thank you for looking into my email. Thanks, Muthu On Mon, Jan 11, 2016 at 3:08 PM, Michael Armbrust wrote: > Also, while extracting a value into Dataset using as[U] method, how could >> I specify a custom encoder/translation to case class (w

Re: Lost tasks due to OutOfMemoryError (GC overhead limit exceeded)

2016-01-12 Thread Muthu Jayakumar

>export SPARK_WORKER_MEMORY=4g May be you could increase the max heapsize on the worker? In case if the OutOfMemory is for the driver, then you may want to set it up explicitly for the driver. Thanks, On Tue, Jan 12, 2016 at 2:04 AM, Barak Yaish wrote: > Hello, > > I've a 5 nodes cluster whic

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Muthu Jayakumar

Thanks Micheal. Let me test it with a recent master code branch. Also for every mapping step should I have to create a new case class? I cannot use Tuple as I have ~130 columns to process. Earlier I had used a Seq[Any] (actually Array[Any] to optimize on serialization) but processed it using RDD (

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Muthu Jayakumar

3 */ mutableRow.setNullAt(0); /* 144 */ } else { /* 145 */ /* 146 */ mutableRow.update(0, primitive1); /* 147 */ } /* 148 */ /* 149 */ return mutableRow; /* 150 */ } /* 151 */ } /* 152 */ Thanks. On Tue, Jan 12, 2016 at 11:35 AM, Muthu Jayakumar wrote: > Thanks Micheal. Le

Re: cast column string -> timestamp in Parquet file

2016-01-21 Thread Muthu Jayakumar

DataFrame and udf. This may be more performant than doing an RDD transformation as you'll only transform just the column that requires to be changed. Hope this helps. On Thu, Jan 21, 2016 at 6:17 AM, Eli Super wrote: > Hi > > I have a large size parquet file . > > I need to cast the whole colu

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Muthu Jayakumar

Does increasing the number of partition helps? You could try out something 3 times what you currently have. Another trick i used was to partition the problem into multiple dataframes and run them sequentially and persistent the result and then run a union on the results. Hope this helps. On Fri,

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Muthu Jayakumar

n Wireless 4G LTE smartphone > > > ---- Original message > From: Muthu Jayakumar > Date: 01/22/2016 3:50 PM (GMT-05:00) > To: Darren Govoni , "Sanders, Isaac B" < > sande...@rose-hulman.edu>, Ted Yu > Cc: user@spark.apache.org > Subject: Re: 10hrs of Scheduler

Re: Spark and Spring Integrations

2015-11-14 Thread Muthu Jayakumar

You could try to use akka actor system with apache spark, if you are intending to use it in online / interactive job execution scenario. On Sat, Nov 14, 2015, 08:19 Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > You are probably trying to access the spring context from the execut

Re: Spark and Spring Integrations

2015-11-15 Thread Muthu Jayakumar

at can use the default serializer provided by Spark. Hope this helps. Thanks, Muthu On Sat, Nov 14, 2015 at 10:18 PM, Netai Biswas wrote: > Hi, > > Thanks for your response. I will give a try with akka also, if you have > any sample code or useful link please do share with me. Anyway

Dataframe schema...

2016-10-19 Thread Muthu Jayakumar

true) ||-- freq: array (nullable = true) |||-- element: long (containsNull = true) Is there a way to convert this attribute from true to false without running any mapping / udf on that column? Please advice, Muthu

Re: Dataframe schema...

2016-10-19 Thread Muthu Jayakumar

simple schema of "col1 thru col4" above. But the problem seem to exist only on that "some_histogram" column which contains the mixed containsNull = true/false. Let me know if this helps. Thanks, Muthu On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust wrote: > Nullabl

Re: Dataframe schema...

2016-10-21 Thread Muthu Jayakumar

t.scala:161) at org.apache.spark.sql.Dataset.(Dataset.scala:167) at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59) at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) at org.apache.spark.sql.Dataset.union(Dataset.scala:1459) Please advice, Muthu On Thu, Oct 20, 2016 at 1:46 AM, Michael Armbrust wr

Re: Dataframe schema...

2016-10-21 Thread Muthu Jayakumar

Thanks Cheng Lian for opening the JIRA. I found this with Spark 2.0.0. Thanks, Muthu On Fri, Oct 21, 2016 at 3:30 PM, Cheng Lian wrote: > Yea, confirmed. While analyzing unions, we treat StructTypes with > different field nullabilities as incompatible types and throws this error. >

Re: DataFrame select non-existing column

2016-11-18 Thread Muthu Jayakumar

Depending on your use case, 'df.withColumn("my_existing_or_new_col", lit(0l))' could work? On Fri, Nov 18, 2016 at 11:18 AM, Kristoffer Sjögren wrote: > Thanks for your answer. I have been searching the API for doing that > but I could not find how to do it? > > Could you give me a code snippet?

Re: Dependency Injection and Microservice development with Spark

2016-12-30 Thread Muthu Jayakumar

Adding to Lars Albertsson & Miguel Morales, I am hoping to see how well scalameta would branch down into support for macros that can rid away sizable DI problems and for the reminder having a class type as args as Miguel Morales mentioned. Thanks, On Wed, Dec 28, 2016 at 6:41 PM, Miguel Morales

Re: Dataframe caching

2017-01-20 Thread Muthu Jayakumar

I guess, this may help in your case? https://spark.apache.org/docs/latest/sql-programming-guide.html#global-temporary-view Thanks, Muthu On Fri, Jan 20, 2017 at 6:27 AM, ☼ R Nair (रविशंकर नायर) < ravishankar.n...@gmail.com> wrote: > Dear all, > > Here is a requirement I

Pretty print a dataframe...

2017-02-16 Thread Muthu Jayakumar

).executedPlan.executeCollect().foreach { // scalastyle:off println r => println(r.getString(0)) // scalastyle:on println } } sessionState is not accessible if I were to write my own explain(log: LoggingAdapter). Please advice, Muthu

Re: Pretty print a dataframe...

2017-02-16 Thread Muthu Jayakumar

This worked. Thanks for the tip Michael. Thanks, Muthu On Thu, Feb 16, 2017 at 12:41 PM, Michael Armbrust wrote: > The toString method of Dataset.queryExecution includes the various plans. > I usually just log that directly. > > On Thu, Feb 16, 2017 at 8:26 AM, Muthu Jayaku

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar

-kill solution may be Spark to Kafka to ElasticSearch? More thoughts welcome please. Thanks, Muthu On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling wrote: > maybe Apache Ignite does fit your requirements > > On 15 March 2017 at 08:44, vincent gromakowski < > vincent.gromakow

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar

our thoughts. Thanks, Muthu On Wed, Mar 15, 2017 at 10:55 AM, vvshvv wrote: > Hi muthu, > > I agree with Shiva, Cassandra also supports SASI indexes, which can > partially replace Elasticsearch functionality. > > Regards, > Uladzimir > > > > Sent from my Mi phone &

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar

for long term storage and trend analysis with full table scans scenarios. But I am thankful for many ideas and perspectives on how this could be looked at. Thanks, Muthu On Wed, Mar 15, 2017 at 7:25 PM, Shiva Ramagopal wrote: > Hi, > > The choice of ES vs Cassandra should rea

Spark repartition question...

2017-04-30 Thread Muthu Jayakumar

) datastore Please advice, Muthu

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread Muthu Jayakumar

them to read Parquet and respond back results. Hope this helps Thanks Muthu On Mon, Jun 5, 2017, 01:01 Sandeep Nemuri wrote: > Well if you are using Hortonworks distribution there is Livy2 which is > compatible with Spark2 and scala 2.11. > > > https://docs.hortonworks.com/HDPDoc

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread Muthu Jayakumar

on I use runs spark in Spark Standalone with a 32 node cluster. Hope this gives some better idea. Thanks, Muthu On Sun, Jun 4, 2017 at 10:33 PM, kant kodali wrote: > Hi Muthu, > > I am actually using Play framework for my Micro service which uses Akka > but I still don't understand

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread Muthu Jayakumar

I run a spark-submit(https://spark.apache.org/docs/latest/spark-standalone. html#launching-spark-applications) in client-mode that starts the micro-service. If you keep the event loop going then the spark context would remain active. Thanks, Muthu On Mon, Jun 5, 2017 at 2:44 PM, kant kodali

DataFrame --- join / groupBy-agg question...

2017-07-11 Thread Muthu Jayakumar

during a join is to set 'spark.sql.shuffle.partitions' it some desired number during spark-submit. I am trying to see if there is a way to provide this programmatically for every step of a groupBy-agg / join. Please advice, Muthu

Re: DataFrame --- join / groupBy-agg question...

2017-07-19 Thread Muthu Jayakumar

ocs/latest/api/scala/index.html#org.apache.spark.sql.Dataset). This way I can change the numbers by the data. Thanks, Muthu On Wed, Jul 19, 2017 at 8:23 AM, ayan guha wrote: > You can use spark.sql.shuffle.partitions to adjust amount of parallelism. > > On Wed, Jul 19, 2017 at 11:

Re: Does Spark run on Java 10?

2018-04-01 Thread Muthu Jayakumar

version of Scala 2.11 becomes full Java 9/10 compliant it could work. Hope, this helps. Thanks, Muthu On Sun, Apr 1, 2018 at 6:57 AM, kant kodali wrote: > Hi All, > > Does anybody got Spark running on Java 10? > > Thanks! > > >

Re: Does Spark run on Java 10?

2018-04-01 Thread Muthu Jayakumar

It is supported with some limitations on JSR 376 (JPMS) that can cause linker errors. Thanks, Muthu On Sun, Apr 1, 2018 at 11:15 AM, kant kodali wrote: > Hi Muthu, > > "On a side note, if some coming version of Scala 2.11 becomes full Java > 9/10 compliant it could work.&qu

Spark + CDB (Cockroach DB) support...

2018-06-15 Thread Muthu Jayakumar

-locality. Thanks Muthu

Re: Parquet

2018-07-20 Thread Muthu Jayakumar

I generally write to Parquet when I want to repeat the operation of reading data and perform different operations on it every time. This would save db time for me. Thanks Muthu On Thu, Jul 19, 2018, 18:34 amin mohebbi wrote: > We do have two big tables each includes 5 billion of rows, so

Re: Encoder for JValue

2018-09-18 Thread Muthu Jayakumar

A naive workaround may be to transform the json4s JValue to String (using something like compact()) and process it as String? Once you are done with the last action, you could write it back as JValue (using something like parse()) Thanks, Muthu On Wed, Sep 19, 2018 at 6:35 AM Arko Provo

Re: error in job

2018-10-06 Thread Muthu Jayakumar

The error means that, you are missing commons-configuration-version.jar from the classpath of the driver/worker. Thanks, Muthu On Sat, Sep 29, 2018 at 11:55 PM yuvraj singh <19yuvrajsing...@gmail.com> wrote: > Hi , i am getting this error please help me . > > > 18/09/30 05

Re: Spark job on dataproc failing with Exception in thread "main" java.lang.NoSuchMethodError: com.googl

2018-12-20 Thread Muthu Jayakumar

The error reads as Precondition.checkArgument() method is on an incorrect parameter signature. Could you check to see how many jars (before the Uber jar), actually contain this method signature? I smell an issue with jar version conflict or similar. Thanks Muthu On Thu, Dec 20, 2018, 02:40 Mich

Re: Re: Can an UDF return a custom class other than case class?

2019-01-07 Thread Muthu Jayakumar

Perhaps use of generic StructType may work in your situation of being language agnostic? case-classes are backed by implicits to provide type conversions into columnar. My 2 cents. Thanks, Mutu On Mon, Jan 7, 2019 at 4:13 AM yeikel valdes wrote: > > > Forwarded Message ===

Number of tasks...

2019-07-29 Thread Muthu Jayakumar

input data size + shuffle operation. Please advice Muthu

Re: Core allocation is scattered

2019-07-31 Thread Muthu Jayakumar

>I am running a spark job with 20 cores but i did not understand why my application get 1-2 cores on couple of machines why not it just run on two nodes like node1=16 cores and node 2=4 cores . but cores are allocated like node1=2 node =1-node 14=1 like that. I believe that's the intended

Re: Using Percentile in Spark SQL

2019-11-11 Thread Muthu Jayakumar

If you would require higher precision, you may have to write a custom udaf. In my case, I ended up storing the data as a key-value ordered list of histograms. Thanks Muthu On Mon, Nov 11, 2019, 20:46 Patrick McCarthy wrote: > Depending on your tolerance for error you could also

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Muthu Jayakumar

I suspect the spark job is somehow having an incorrect (newer) version of json4s in the classpath. json4s 3.5.3 is the utmost version that can be used. Thanks, Muthu On Mon, Feb 17, 2020, 06:43 Mich Talebzadeh wrote: > Hi, > > Spark version 2.4.3 > Hbase 1.2.7 > > Data is

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Muthu Jayakumar

older version, make sure all of them are older than 3.2.11 at the least. Hope it helps. Thanks, Muthu On Mon, Feb 17, 2020 at 1:15 PM Mich Talebzadeh wrote: > Thanks Muthu, > > > I am using the following jar files for now in local mode i.e. > spark-shell_local > --j

writing FLume data to HDFS

2014-07-10 Thread Sundaram, Muthu X.

I am new to spark. I am trying to do the following. Netcat-->Flume-->Spark streaming(process Flume Data)-->HDFS. My flume config file has following set up. Source = netcat Sink=avrosink. Spark Streaming code: I am able to print data from flume to the monitor. But I am struggling to create a fil

RE: writing FLume data to HDFS

2014-07-14 Thread Sundaram, Muthu X.

g Cc: u...@spark.incubator.apache.org Subject: Re: writing FLume data to HDFS What is the error you are getting when you say "??I was trying to write the data to hdfs..but it fails…" TD On Thu, Jul 10, 2014 at 1:36 PM, Sundaram, Muthu X. mailto:muthu.x.sundaram@sabre.com>

RE: writing FLume data to HDFS

2014-07-15 Thread Sundaram, Muthu X.

My intention is not to write data directly from flume to hdfs. I have to collect messages from queue using flume and send it to spark streaming for additional processing. I will try what you have suggested. Thanks, Muthu From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Monday

Tranforming flume events using Spark transformation functions

2014-07-22 Thread Sundaram, Muthu X.

op? How do I create this JavaRDD? In the loop I am able to get every record and I am able to print them. I appreciate any help here. Thanks, Muthu

RE: Tranforming flume events using Spark transformation functions

2014-07-22 Thread Sundaram, Muthu X.

this JavaRDD? In the loop I am able to get every record and I am able to print them. I appreciate any help here. Thanks, Muthu

64 matches

Mail list logo