I just did a test, even for a single node (local deployment), spark can
handle the data whose size is much larger than the total memory.
My test VM (2g ram, 2 cores):
$ free -m
totalusedfree shared buff/cache
available
Mem: 19921845
With autoscaling can have any numbers of executors.
Thanks
On Fri, Apr 8, 2022, 08:27 Wes Peng wrote:
> I once had a file which is 100+GB getting computed in 3 nodes, each node
> has 24GB memory only. And the job could be done well. So from my
> experience spark cluster seems to work correctly
My bad, yes of course that! still i don't like the ..
select("count(myCol)") .. part in my line is there any replacement to that ?
Le ven. 8 avr. 2022 à 06:13, Sean Owen a écrit :
> Just do an average then? Most of my point is that filtering to one group
> and then grouping is pointless.
>
> On
What if i do avg instead of count?
Le ven. 8 avr. 2022 à 05:32, Sean Owen a écrit :
> Wait, why groupBy at all? After the filter only rows with myCol equal to
> your target are left. There is only one group. Don't group just count after
> the filter?
>
> On Thu, Apr 7, 2022, 10:27 PM sam smith
Wait, why groupBy at all? After the filter only rows with myCol equal to
your target are left. There is only one group. Don't group just count after
the filter?
On Thu, Apr 7, 2022, 10:27 PM sam smith wrote:
> I want to aggregate a column by counting the number of rows having the
> value "myTarg
I want to aggregate a column by counting the number of rows having the
value "myTargetValue" and return the result
I am doing it like the following:in JAVA
> long result =
> dataset.filter(dataset.col("myCol").equalTo("myTargetVal")).groupBy(col("myCol")).agg(count(dataset.col("myCol"))).select("c
I once had a file which is 100+GB getting computed in 3 nodes, each node
has 24GB memory only. And the job could be done well. So from my
experience spark cluster seems to work correctly for big files larger
than memory by swapping them to disk.
Thanks
rajat kumar wrote:
Tested this with exec
how many executors do you have?
rajat kumar wrote:
Tested this with executors of size 5 cores, 17GB memory. Data vol is
really high around 1TB
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi,
I'm running Spark 2.4.4. When I execute a simple query "select * from table
group by col", I found the SparkListenerTaskEnd event in event log reports all
negative time duration for aggregate time total:
{"ID":6,"Name":"aggregate time total (min, med,
max)","Update":"2","Value":"-46","Inte
Tested this with executors of size 5 cores, 17GB memory. Data vol is really
high around 1TB
Thanks
Rajat
On Thu, Apr 7, 2022, 23:43 rajat kumar wrote:
> Hello Users,
>
> I got following error, tried increasing executor memory and memory
> overhead that also did not help .
>
> ExecutorLost Failu
Hello Users,
I got following error, tried increasing executor memory and memory overhead
that also did not help .
ExecutorLost Failure(executor1 exited caused by one of the following tasks)
Reason: container from a bad node:
java.lang.OutOfMemoryError: enough memory for aggregation
Can someone
(Don't cross post please)
Generally you definitely want to compile and test vs what you're running on.
There shouldn't be many binary or source incompatibilities -- these are
avoided in a major release where possible. So it may need no code change.
But I would certainly recompile just on principle!
Hi spark community
I have quick question .I am planning to migrate from spark 3.0.1 to spark
3.2.
Do I need to recompile my application with 3.2 dependencies or application
compiled with 3.0.1 will work fine on 3.2 ?
Regards
Pralabh kumar
Since your Hbase is supported by the external vendor, I would ask them to
justify their choice of storage for Hbase and any suggestion they have
vis-a-vis S3 etc.
Spark has an efficient API to Hbase including remote Hbase. I have used in
the past reading from Hbase.
HTH
view my Linkedin p
What might be the biggest factor affecting running time here is that
Drill's query execution is not fault tolerant while Spark's is. The
philosophy is different, Drill's says "when you're doing interactive
analytics and a node dies, killing your query as it goes, just run the
query again."
O
Hi Wes,
Thanks for the report! I like it (mostly because it's short and concise).
Thank you.
I know nothing about Drill and am curious about the similar execution times
and this sentence in the report: "Spark is the second fastest, that should
be reasonable, since both Spark and Drill have almost
Thanks for pointing this out.
So currently data is stored in hbase on adls. Question (sorry I might be
ignorant): is it clear that parquet on s3 would be faster as storage to read
from than hbase on adls?
In general, I ve found it hard after my processing is done, if I have an
application that
"4. S3: I am not using it, but people in the thread started suggesting
potential solutions involving s3. It is an azure system, so hbase is stored
on adls. In fact the nature of my application (geospatial stuff) requires
me to use geomesa libs, which only allows directly writing from spark to
hbase
Ok. Your architect has decided to emulate anything on prem to the cloud.You
are not really taking any advantages of cloud offerings or scalability. For
example, how does your Hadoop clustercater for the increased capacity.
Likewise your spark nodes are pigeonholed with your Hadoop nodes. Old wine
I made a simple test to query time for several SQL engines including
mysql, hive, drill and spark. The report,
https://cloudcache.net/data/query-time-mysql-hive-drill-spark.pdf
It maybe have no special meaning, just for fun. :)
regards.
"But it will be faster to use S3 (or GCS) through some network and it will
be faster than writing to the local SSD. I don't understand the point
here."
Minio is a S3 mock, so you run minio local.
tor. 7. apr. 2022 kl. 09:27 skrev Mich Talebzadeh :
> Ok so that is your assumption. The whole thing
Thanks for active discussion and sharing your knowledge :-)
1.Cluster is a managed hadoop cluster on Azure in the cloud. It has hbase, and
spark, and hdfs shared .
2.Hbase is on the cluster, so not standalone. It comes from an enterprise-level
template from a commercial vendor, so assuming this
Ok so that is your assumption. The whole thing is based on-premise on JBOD
(including hadoop cluster which has Spark binaries on each node as I
understand) as I understand. But it will be faster to use S3 (or GCS)
through some network and it will be faster than writing to the local SSD. I
don't und
1. Where does S3 come into this
He is processing data for each day at a time. So to dump each day to a fast
storage he can use parquet files and write it to S3.
ons. 6. apr. 2022 kl. 22:27 skrev Mich Talebzadeh :
>
> Your statement below:
>
>
> I believe I have found the issue: the job writes
24 matches
Mail list logo