ldLeft() like semantics I
have to only combine the previously combined result with the result from the
current time tn.
Please advice,
Muthu
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-tp28183.html
Sent from the Apache S
perform
simple filters and sort using ElasticSearch and for more complex aggregate,
Spark Dataframe can come back to the rescue :).
Please advice on other possible data-stores I could use?
Thanks,
Muthu
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Fast-write
I may be having a naive question on join / groupBy-agg. During the days of
RDD, whenever I wanted to perform a. groupBy-agg, I used to say reduceByKey
(of PairRDDFunctions) with an optional Partition-Strategy (with is number of
partitions or Partitioner) b. join (of PairRDDFunctions) and its varian
.partitions'. In ideal situations, we have a long
running application that uses the same spark-session and runs one or more
query using FAIR mode.
Thanks,
Muthu
On Wed, Jul 19, 2017 at 6:03 AM, qihuagao [via Apache Spark User List] <
ml+s1001560n28879...@n3.nabble.com> wrote:
> also
nsive APIs available to SparkListener interface
that's available with-in every spark application.
Please advice,
Muthu
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-API-tp29115.html
Sent from the Apache Spark User List mailing list archiv
I had similar question in the past and worked around by having my
spark-submit application to register to my master application in-order to
co-ordinate kill and/or progress of execution. This is a bit clergy I
suppose in comparison to a REST like API available in the spark stand-alone
cluster.
Are you getting OutOfMemory on the driver or on the executor? Typical cause
of OOM in Spark can be due to fewer number of tasks for a job.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-OutOfMemory-Error-in-local-mode-tp29081p29117.html
Sent fr
build
Please advice,
Muthu
3 months ago /bin/sh -c #(nop) ADD
file:3a7bff4e139bcacc5… 69.2MB
(2) $ docker run --entrypoint "/usr/local/openjdk-8/bin/java" 3ef86250a35b
'-version'
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-b01)
OpenJDK 64-Bit Server
`, `other_useful_id`,
`json_content`, `file_path`.
Assume that I already have the required HDFS url libraries in my classpath.
Please advice,
Muthu
sparkContext. I did try to use GCS Java API to read content, but ran into
many JAR conflicts as the HDFS wrapper and the JAR library uses different
dependencies.
Hope this findings helps others as well.
Thanks,
Muthu
On Mon, 11 Jul 2022 at 14:11, Enrico Minack wrote:
> All you need to do
s make a difference regarding
I guess, in the question above I do have to process row-wise and RDD may be
more efficient?
Thanks,
Muthu
On Tue, 12 Jul 2022 at 14:55, ayan guha wrote:
> Another option is:
>
> 1. collect the dataframe with file path
> 2. create a list of paths
> 3. crea
tnull(input[0, org.apache.spark.sql.Row, true], top
level row object)
+- input[0, org.apache.spark.sql.Row, true]
Let me know if you would like me try to create a more simplified reproducer
to this problem. Perhaps I should not be using Option[T] for nullable
schema values?
Please advice,
Muthu
i don't
know how to split and map the row elegantly. Hence using it as RDD.
Thanks,
Muthu
On Thu, Jul 28, 2016 at 10:47 PM, Dong Meng wrote:
> you can specify nullable in StructField
>
> On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar
> wrote:
>
>> Hello there,
&g
rations on this job.
On a side note, I do understand that 200 parquet part files for the above
2.2 G seems over-kill for a 128 MB block size. Ideally it should be 18
parts or so.
Please advice,
Muthu
spark.read.parquet(parquetFile).toJavaRDD.partitions.size()
res2: Int = 20
Can I suspect something with dynamic allocation perhaps?
Please advice,
Muthu
On Sat, Aug 6, 2016 at 3:23 PM, Mich Talebzadeh
wrote:
> 720 cores Wow. That is a hell of cores Muthu :)
>
> Ok let us take a step back
>
>
Hello Hao Ren,
Doesn't the code...
val add = udf {
(a: Int) => a + notSer.value
}
Mean UDF function that Int => Int ?
Thanks,
Muthu
On Sun, Aug 7, 2016 at 2:31 PM, Hao Ren wrote:
> I am playing with spark 2.0
> What I tried to test is:
>
> Create a UDF
f.getFloat(ParquetOutputFormat.MEMORY_POOL_RATIO,
MemoryManager.DEFAULT_MEMORY_POOL_RATIO);).
I wonder how is this provided thru Apache Spark. Meaning, I see that
'TaskAttemptContext' seems to be the hint to provide this. But I am not
able to find a way I could provide this configuration.
Please advice,
] method, how could I
specify a custom encoder/translation to case class (where I don't have the
same column-name mapping or same data-type mapping)?
Please advice,
Muthu
nd I think this may make it strongly typed?
Thank you for looking into my email.
Thanks,
Muthu
On Mon, Jan 11, 2016 at 3:08 PM, Michael Armbrust
wrote:
> Also, while extracting a value into Dataset using as[U] method, how could
>> I specify a custom encoder/translation to case class (w
>export SPARK_WORKER_MEMORY=4g
May be you could increase the max heapsize on the worker? In case if the
OutOfMemory is for the driver, then you may want to set it up explicitly
for the driver.
Thanks,
On Tue, Jan 12, 2016 at 2:04 AM, Barak Yaish wrote:
> Hello,
>
> I've a 5 nodes cluster whic
Thanks Micheal. Let me test it with a recent master code branch.
Also for every mapping step should I have to create a new case class? I
cannot use Tuple as I have ~130 columns to process. Earlier I had used a
Seq[Any] (actually Array[Any] to optimize on serialization) but processed
it using RDD (
3 */ mutableRow.setNullAt(0);
/* 144 */ } else {
/* 145 */
/* 146 */ mutableRow.update(0, primitive1);
/* 147 */ }
/* 148 */
/* 149 */ return mutableRow;
/* 150 */ }
/* 151 */ }
/* 152 */
Thanks.
On Tue, Jan 12, 2016 at 11:35 AM, Muthu Jayakumar
wrote:
> Thanks Micheal. Le
DataFrame and udf. This may be more performant than doing an RDD
transformation as you'll only transform just the column that requires to be
changed.
Hope this helps.
On Thu, Jan 21, 2016 at 6:17 AM, Eli Super wrote:
> Hi
>
> I have a large size parquet file .
>
> I need to cast the whole colu
Does increasing the number of partition helps? You could try out something
3 times what you currently have.
Another trick i used was to partition the problem into multiple dataframes
and run them sequentially and persistent the result and then run a union on
the results.
Hope this helps.
On Fri,
n Wireless 4G LTE smartphone
>
>
> ---- Original message
> From: Muthu Jayakumar
> Date: 01/22/2016 3:50 PM (GMT-05:00)
> To: Darren Govoni , "Sanders, Isaac B" <
> sande...@rose-hulman.edu>, Ted Yu
> Cc: user@spark.apache.org
> Subject: Re: 10hrs of Scheduler
You could try to use akka actor system with apache spark, if you are
intending to use it in online / interactive job execution scenario.
On Sat, Nov 14, 2015, 08:19 Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:
> You are probably trying to access the spring context from the execut
at can use the default
serializer provided by Spark.
Hope this helps.
Thanks,
Muthu
On Sat, Nov 14, 2015 at 10:18 PM, Netai Biswas
wrote:
> Hi,
>
> Thanks for your response. I will give a try with akka also, if you have
> any sample code or useful link please do share with me. Anyway
true)
||-- freq: array (nullable = true)
|||-- element: long (containsNull = true)
Is there a way to convert this attribute from true to false without running
any mapping / udf on that column?
Please advice,
Muthu
simple schema of "col1
thru col4" above. But the problem seem to exist only on that
"some_histogram" column which contains the mixed containsNull = true/false.
Let me know if this helps.
Thanks,
Muthu
On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust
wrote:
> Nullabl
t.scala:161)
at org.apache.spark.sql.Dataset.(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
Please advice,
Muthu
On Thu, Oct 20, 2016 at 1:46 AM, Michael Armbrust
wr
Thanks Cheng Lian for opening the JIRA. I found this with Spark 2.0.0.
Thanks,
Muthu
On Fri, Oct 21, 2016 at 3:30 PM, Cheng Lian wrote:
> Yea, confirmed. While analyzing unions, we treat StructTypes with
> different field nullabilities as incompatible types and throws this error.
>
Depending on your use case, 'df.withColumn("my_existing_or_new_col",
lit(0l))' could work?
On Fri, Nov 18, 2016 at 11:18 AM, Kristoffer Sjögren
wrote:
> Thanks for your answer. I have been searching the API for doing that
> but I could not find how to do it?
>
> Could you give me a code snippet?
Adding to Lars Albertsson & Miguel Morales, I am hoping to see how
well scalameta would branch down into support for macros that can rid away
sizable DI problems and for the reminder having a class type as args as Miguel
Morales mentioned.
Thanks,
On Wed, Dec 28, 2016 at 6:41 PM, Miguel Morales
I guess, this may help in your case?
https://spark.apache.org/docs/latest/sql-programming-guide.html#global-temporary-view
Thanks,
Muthu
On Fri, Jan 20, 2017 at 6:27 AM, ☼ R Nair (रविशंकर नायर) <
ravishankar.n...@gmail.com> wrote:
> Dear all,
>
> Here is a requirement I
).executedPlan.executeCollect().foreach
{
// scalastyle:off println
r => println(r.getString(0))
// scalastyle:on println
}
}
sessionState is not accessible if I were to write my own explain(log:
LoggingAdapter).
Please advice,
Muthu
This worked. Thanks for the tip Michael.
Thanks,
Muthu
On Thu, Feb 16, 2017 at 12:41 PM, Michael Armbrust
wrote:
> The toString method of Dataset.queryExecution includes the various plans.
> I usually just log that directly.
>
> On Thu, Feb 16, 2017 at 8:26 AM, Muthu Jayaku
-kill solution may be Spark to Kafka to ElasticSearch?
More thoughts welcome please.
Thanks,
Muthu
On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling
wrote:
> maybe Apache Ignite does fit your requirements
>
> On 15 March 2017 at 08:44, vincent gromakowski <
> vincent.gromakow
our thoughts.
Thanks,
Muthu
On Wed, Mar 15, 2017 at 10:55 AM, vvshvv wrote:
> Hi muthu,
>
> I agree with Shiva, Cassandra also supports SASI indexes, which can
> partially replace Elasticsearch functionality.
>
> Regards,
> Uladzimir
>
>
>
> Sent from my Mi phone
&
for long term storage and trend analysis with
full table scans scenarios.
But I am thankful for many ideas and perspectives on how this could be
looked at.
Thanks,
Muthu
On Wed, Mar 15, 2017 at 7:25 PM, Shiva Ramagopal wrote:
> Hi,
>
> The choice of ES vs Cassandra should rea
) datastore
Please advice,
Muthu
them to read Parquet and respond back results.
Hope this helps
Thanks
Muthu
On Mon, Jun 5, 2017, 01:01 Sandeep Nemuri wrote:
> Well if you are using Hortonworks distribution there is Livy2 which is
> compatible with Spark2 and scala 2.11.
>
>
> https://docs.hortonworks.com/HDPDoc
on I use runs spark in Spark
Standalone with a 32 node cluster.
Hope this gives some better idea.
Thanks,
Muthu
On Sun, Jun 4, 2017 at 10:33 PM, kant kodali wrote:
> Hi Muthu,
>
> I am actually using Play framework for my Micro service which uses Akka
> but I still don't understand
I run a spark-submit(https://spark.apache.org/docs/latest/spark-standalone.
html#launching-spark-applications) in client-mode that starts the
micro-service. If you keep the event loop going then the spark context
would remain active.
Thanks,
Muthu
On Mon, Jun 5, 2017 at 2:44 PM, kant kodali
during a join
is to set 'spark.sql.shuffle.partitions' it some desired number during
spark-submit. I am trying to see if there is a way to provide this
programmatically for every step of a groupBy-agg / join.
Please advice,
Muthu
ocs/latest/api/scala/index.html#org.apache.spark.sql.Dataset).
This way I can change the numbers by the data.
Thanks,
Muthu
On Wed, Jul 19, 2017 at 8:23 AM, ayan guha wrote:
> You can use spark.sql.shuffle.partitions to adjust amount of parallelism.
>
> On Wed, Jul 19, 2017 at 11:
version of Scala 2.11 becomes full Java 9/10
compliant it could work.
Hope, this helps.
Thanks,
Muthu
On Sun, Apr 1, 2018 at 6:57 AM, kant kodali wrote:
> Hi All,
>
> Does anybody got Spark running on Java 10?
>
> Thanks!
>
>
>
It is supported with some limitations on JSR 376 (JPMS) that can cause
linker errors.
Thanks,
Muthu
On Sun, Apr 1, 2018 at 11:15 AM, kant kodali wrote:
> Hi Muthu,
>
> "On a side note, if some coming version of Scala 2.11 becomes full Java
> 9/10 compliant it could work.&qu
-locality.
Thanks
Muthu
I generally write to Parquet when I want to repeat the operation of reading
data and perform different operations on it every time. This would save db
time for me.
Thanks
Muthu
On Thu, Jul 19, 2018, 18:34 amin mohebbi
wrote:
> We do have two big tables each includes 5 billion of rows, so
A naive workaround may be to transform the json4s JValue to String (using
something like compact()) and process it as String? Once you are done with
the last action, you could write it back as JValue (using something like
parse())
Thanks,
Muthu
On Wed, Sep 19, 2018 at 6:35 AM Arko Provo
The error means that, you are missing commons-configuration-version.jar
from the classpath of the driver/worker.
Thanks,
Muthu
On Sat, Sep 29, 2018 at 11:55 PM yuvraj singh <19yuvrajsing...@gmail.com>
wrote:
> Hi , i am getting this error please help me .
>
>
> 18/09/30 05
The error reads as Precondition.checkArgument() method is on an incorrect
parameter signature.
Could you check to see how many jars (before the Uber jar), actually
contain this method signature?
I smell an issue with jar version conflict or similar.
Thanks
Muthu
On Thu, Dec 20, 2018, 02:40 Mich
Perhaps use of generic StructType may work in your situation of being
language agnostic? case-classes are backed by implicits to provide type
conversions into columnar.
My 2 cents.
Thanks,
Mutu
On Mon, Jan 7, 2019 at 4:13 AM yeikel valdes wrote:
>
>
> Forwarded Message ===
input data size + shuffle operation.
Please advice
Muthu
>I am running a spark job with 20 cores but i did not understand why my
application get 1-2 cores on couple of machines why not it just run on two
nodes like node1=16 cores and node 2=4 cores . but cores are allocated like
node1=2 node =1-node 14=1 like that.
I believe that's the intended
If you would require higher precision, you may have to write a custom udaf.
In my case, I ended up storing the data as a key-value ordered list of
histograms.
Thanks
Muthu
On Mon, Nov 11, 2019, 20:46 Patrick McCarthy
wrote:
> Depending on your tolerance for error you could also
I suspect the spark job is somehow having an incorrect (newer) version of
json4s in the classpath. json4s 3.5.3 is the utmost version that can be
used.
Thanks,
Muthu
On Mon, Feb 17, 2020, 06:43 Mich Talebzadeh
wrote:
> Hi,
>
> Spark version 2.4.3
> Hbase 1.2.7
>
> Data is
older version, make sure all of them are older
than 3.2.11 at the least.
Hope it helps.
Thanks,
Muthu
On Mon, Feb 17, 2020 at 1:15 PM Mich Talebzadeh
wrote:
> Thanks Muthu,
>
>
> I am using the following jar files for now in local mode i.e.
> spark-shell_local
> --j
I am new to spark. I am trying to do the following.
Netcat-->Flume-->Spark streaming(process Flume Data)-->HDFS.
My flume config file has following set up.
Source = netcat
Sink=avrosink.
Spark Streaming code:
I am able to print data from flume to the monitor. But I am struggling to
create a fil
g
Cc: u...@spark.incubator.apache.org
Subject: Re: writing FLume data to HDFS
What is the error you are getting when you say "??I was trying to write the
data to hdfs..but it fails…"
TD
On Thu, Jul 10, 2014 at 1:36 PM, Sundaram, Muthu X.
mailto:muthu.x.sundaram@sabre.com>
My intention is not to write data directly from flume to hdfs. I have to
collect messages from queue using flume and send it to spark streaming for
additional processing. I will try what you have suggested.
Thanks,
Muthu
From: Tathagata Das [mailto:tathagata.das1...@gmail.com]
Sent: Monday
op? How do I
create this JavaRDD?
In the loop I am able to get every record and I am able to print them.
I appreciate any help here.
Thanks,
Muthu
this JavaRDD?
In the loop I am able to get every record and I am able to print them.
I appreciate any help here.
Thanks,
Muthu
64 matches
Mail list logo