Re: Spark connect: Table caching for global use?

2025-02-17 Thread Ángel
Well, that's not strictly correct. I had two different memory leaks on the driver side because of caching. Both of them in stream processes; one of them in Scala (I forgot to unpersist the cached dataframe) and the other one in PySpark (unpersisting cached dataframes wasn't enough

Re: Spark connect: Table caching for global use?

2025-02-17 Thread Subhasis Mukherjee
> I understood that caching a table pegged the RDD partitions into the memory of the executors holding the partition. Your understanding is correct. Nothing to worry on the driver side while creating the temp view. On Sun, Feb 16, 2025, 10:47 PM Mich Talebzadeh wrote: > Ok let us look a

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Mich Talebzadeh
Ok let us look at this - Temporary Views, Metadata is stored on the driver; *data remains distributed across executors.* - Caching/Persisting, *Data is stored in the executors' memory or disk. * - The statement *"created on driver memory"* refers to

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Tim Robertson
Thanks Mich > created on driver memory That I hadn't anticipated. Are you sure? I understood that caching a table pegged the RDD partitions into the memory of the executors holding the partition. On Sun, Feb 16, 2025 at 11:17 AM Mich Talebzadeh wrote: > yep. created on driver me

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Mich Talebzadeh
yep. created on driver memory. watch for OOM if the size becomes too large spark-submit --driver-memory 8G ... HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Tim Robertson
Answering my own question. Global temp views get created in the global_temp database, so can be accessed thusly. Thanks Dataset s = spark.read().parquet("/tmp/svampeatlas/*"); s.createOrReplaceGlobalTempView("occurrence_svampe"); spark.catalog().cacheTable("global_temp.occurrence_svampe"); On S

Spark connect: Table caching for global use?

2025-02-16 Thread Tim Robertson
Hi folks Is it possible to cache a table for shared use across sessions with spark connect? I'd like to load a read only table once that many sessions will then query to improve performance. This is an example of the kind of thing that I have been trying, but have not succeeded with. SparkSess

[Spark/deeplyR] how come spark is caching tables read through jdbc connection from oracle, even when memory=false is chosen

2023-01-31 Thread Joris Billen
t this memory =false is ignored when reading through jdbc? 2) can someone confirm that there is a lot of automatic caching happening (and hence a lot of counts and a lot of actions)? Thanks for input! - To unsubscrib

Memory leak while caching in foreachBatch block

2022-08-10 Thread kineret M
Hi, We have a structured streaming application, and we face a memory leak while caching in the foreachBatch block. We do unpersist every iteration, and we also verify via "spark.sparkContext.getPersistentRDDs" that we don't have unnecessary cached data. We also noted in the pro

Re: Caching

2020-12-07 Thread Lalwani, Jayesh
because caching data frames introduces memory overheads, and it’s not going to prematurely do it. It will combine processing of various dataframes within a stage. However, in your case, you are doing aggregation which will create a new stage You can check the execution plan if you like From

Re: Caching

2020-12-07 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
You are using same csv twice? Отправлено с iPhone > 7 дек. 2020 г., в 18:32, Amit Sharma написал(а): > >  > Hi All, I am using caching in my code. I have a DF like > val DF1 = read csv. > val DF2 = DF1.groupBy().agg().select(.) > > Val DF3 = read csv .join(DF1)

Re: Caching

2020-12-07 Thread Amit Sharma
Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will optimize the query. Thanks Amit On Mon, Dec 7, 2020 at 1:16 PM Lalwani, Jayesh wrote: > Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, > without caching, Spark will re

Re: Caching

2020-12-07 Thread Amit Sharma
Sean, you mean if df is used more than once in transformation then use cache. But be frankly that is also not true because at many places even if df is used once with caching and without cache also it gives same result. How to decide should we use cache or not Thanks Amit On Mon, Dec 7, 2020

Re: Caching

2020-12-07 Thread Lalwani, Jayesh
Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, without caching, Spark will read the CSV twice: Once to load it for DF1, and once to load it for DF2. When you add a cache on DF1 or DF2, it reads from CSV only once. You might want to look at doing a windowed query on

Re: Caching

2020-12-07 Thread Sean Owen
the end action is one on DF3. In > that case action of DF1 should be just 1 or it depends how many times this > dataframe is used in transformation. > > I believe even if we use a dataframe multiple times for transformation , > use caching should be based on actions. In my case

Re: Caching

2020-12-07 Thread Amit Sharma
dataframe multiple times for transformation , use caching should be based on actions. In my case action is one save call on DF3. Please correct me if i am wrong. Thanks Amit On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas < theo.gkountou...@futurewei.com> wrote: > Hi Amit, > >

RE: Caching

2020-12-07 Thread Theodoros Gkountouvas
@spark.apache.org Subject: Caching Hi All, I am using caching in my code. I have a DF like val DF1 = read csv. val DF2 = DF1.groupBy().agg().select(.) Val DF3 = read csv .join(DF1).join(DF2) DF3 .save. If I do not cache DF2 or Df1 it is taking longer time . But i am doing 1 action only why

Caching

2020-12-07 Thread Amit Sharma
Hi All, I am using caching in my code. I have a DF like val DF1 = read csv. val DF2 = DF1.groupBy().agg().select(.) Val DF3 = read csv .join(DF1).join(DF2) DF3 .save. If I do not cache DF2 or Df1 it is taking longer time . But i am doing 1 action only why do I need to cache. Thanks

Spark Union Breaks Caching Behaviour

2020-04-07 Thread Yi Huang
Dear Community, I am a beginner of using Spark. I am confused by the comment of the following method. def union(other: Dataset[T]): Dataset[T] = withSetOperator { // This breaks caching, but it's usually ok because it addresses a very specific use case: // using union to union many fil

Re: Caching tables in spark

2019-08-28 Thread Tzahi File
t; Take a look at this article >> >> >> >> >> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html >> >> >> >> *From:* Tzahi File >> *Sent:* Wednesday, August 28, 2019 5:18 AM >> *To:* user >> *Subject:* Caching t

Re: Caching tables in spark

2019-08-28 Thread Subash Prabakar
; > > *From:* Tzahi File > *Sent:* Wednesday, August 28, 2019 5:18 AM > *To:* user > *Subject:* Caching tables in spark > > > > Hi, > > > > Looking for your knowledge with some question. > > I have 2 different processes that read from the same raw data t

RE: Caching tables in spark

2019-08-28 Thread email
Take a look at this article https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html From: Tzahi File Sent: Wednesday, August 28, 2019 5:18 AM To: user Subject: Caching tables in spark Hi, Looking for your knowledge with some question. I have 2

Caching tables in spark

2019-08-28 Thread Tzahi File
Hi, Looking for your knowledge with some question. I have 2 different processes that read from the same raw data table (around 1.5 TB). Is there a way to read this data once and cache it somehow and to use this data in both processes? Thanks -- Tzahi File Data Engineer [image: ironSource]

Re: Questions about caching

2019-01-01 Thread Gourav Sengupta
g to exploit some of Spark's caching behavior > to speed up the interactive computation portion of the analysis, > probably by writing a thin convenience wrapper. I have a couple > questions I've been unable to find definitive answers to, which would > help me design this w

Re: Questions about caching

2018-12-24 Thread Bin Fan
Hi Andrew, Since you mentioned the alternative solution with Alluxio <http://alluxio.org>, here is a more comprehensive tutorial on caching Spark dataframes on Alluxio: https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio Namely, caching your dataframe is simply r

Re: Questions about caching

2018-12-18 Thread Reza Safi
cs objects). We have a basic version > working, but I'm looking to exploit some of Spark's caching behavior > to speed up the interactive computation portion of the analysis, > probably by writing a thin convenience wrapper. I have a couple > questions I've been unable to f

Questions about caching

2018-12-11 Thread Andrew Melo
basic version working, but I'm looking to exploit some of Spark's caching behavior to speed up the interactive computation portion of the analysis, probably by writing a thin convenience wrapper. I have a couple questions I've been unable to find definitive answers to, which would help me

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-24 Thread Sonal Goyal
ust did it's to execute with local[X] and this problem > doesn't happen. Communication problems? > > 2018-08-23 22:43 GMT+02:00 Guillermo Ortiz : > >> it's a complex DAG before the point I cache the RDD, they are some joins, >> filter and maps before caching d

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-24 Thread Guillermo Ortiz
Another test I just did it's to execute with local[X] and this problem doesn't happen. Communication problems? 2018-08-23 22:43 GMT+02:00 Guillermo Ortiz : > it's a complex DAG before the point I cache the RDD, they are some joins, > filter and maps before caching data, but

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-23 Thread Guillermo Ortiz
it's a complex DAG before the point I cache the RDD, they are some joins, filter and maps before caching data, but most of the times it doesn't take almost time to do it. I could understand if it would take the same time all the times to process or cache the data. Besides it seems rando

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-23 Thread Sonal Goyal
How are these small RDDs created? Could the blockage be in their compute creation instead of their caching? Thanks, Sonal Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Thu, Aug 23, 2018 at 6:38 PM, Guillermo Ortiz wrote: > I use spark wi

Caching small Rdd's take really long time and Spark seems frozen

2018-08-23 Thread Guillermo Ortiz
I use spark with caching with persist method. I have several RDDs what I cache but some of them are pretty small (about 300kbytes). Most of time it works well and usually lasts 1s the whole job, but sometimes it takes about 40s to store 300kbytes to cache. If I go to the SparkUI->Cache, I can

[Spark MLib]: RDD caching behavior of KMeans

2018-07-10 Thread mkhan37
Hi All, I was varying the storage levels of RDD caching in the KMeans program implemented using the MLib library and got some very confusing and interesting results. The base code of the application is from a Benchmark suite named SparkBench <https://github.com/CODAIT/spark-bench> . I c

Caching when you perfom one action and have a dataframe used more than once.

2018-06-28 Thread mxmn
Hi, Let's say I have the following code (it's an example) df_a = spark.read.json() df_b = df_a.sample(False, 0.5, 10) df_c = df_a.sample(False, 0.5, 10) df_d = df_b.union(df_c) df_d.count() Do we have to cache df_a as it is used by df_b and df_c, or spark will notice that df_a is used twice in t

Dataset Caching and Unpersisting

2018-05-02 Thread Daniele Foroni
Hi all, I am having troubles with caching and unpersisting a dataset. I have a cycle that at each iteration filters my dataset. I realized that caching every x steps (e.g., 50 steps) gives good performance. However, after a certain number of caching operations, it seems that the memory used for

Caching dataframes and overwrite

2017-11-21 Thread Michael Artz
I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs. There is a very similar post to

Re: Multiple transformations without recalculating or caching

2017-11-19 Thread Phillip Henry
>> On Fri, 17 Nov 2017, 10:03 Fernando Pereira, >> wrote: >> >>> Dear Spark users >>> >>> Is it possible to take the output of a transformation (RDD/Dataframe) >>> and feed it to two independent transformations without recalculating the >>

Re: Multiple transformations without recalculating or caching

2017-11-17 Thread Fernando Pereira
hat and then read it again and get > your stats? > > On Fri, 17 Nov 2017, 10:03 Fernando Pereira, wrote: > >> Dear Spark users >> >> Is it possible to take the output of a transformation (RDD/Dataframe) and >> feed it to two independent transformations without rec

Re: Multiple transformations without recalculating or caching

2017-11-17 Thread Sebastian Piu
transformation (RDD/Dataframe) and > feed it to two independent transformations without recalculating the first > transformation and without caching the whole dataset? > > Consider the case of a very large dataset (1+TB) which suffered several > transformations and now we want

Multiple transformations without recalculating or caching

2017-11-17 Thread Fernando Pereira
Dear Spark users Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset? Consider the case of a very large dataset (1+TB) which suffered several

Re: Unexpected caching behavior

2017-10-26 Thread pnpritchard
I'm not sure why the example code didn't come through, so I'll try again: val x = spark.range(100) val y = x.map(_.toString) println(x.storageLevel) //StorageLevel(1 replicas) println(y.storageLevel) //StorageLevel(1 replicas) x.cache().foreachPartition(_ => ()) y.cache().foreachPartition(_ => (

Re: Unexpected caching behavior

2017-10-26 Thread pnpritchard
Not sure why the example code didn't come through, but here I'll try again: val x = spark.range(100) val y = x.map(_.toString) println(x.storageLevel) //StorageLevel(1 replicas) println(y.storageLevel) //StorageLevel(1 replicas) x.cache().foreachPartition(_ => ()) y.cache().foreachPartition(_ =>

Unexpected caching behavior

2017-10-26 Thread pnpritchard
I've noticed that when unpersisting an "upstream" Dataset, then the "downstream" Dataset is also unpersisted. I did not expect this behavior, and I've noticed that RDDs do not have this behavior. Below I've pasted a simple reproducible case. There are two datasets, x and y, where y is created by a

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Ram Navan
nd count() - both actions took 8 minutes each >> in my scenario. I'd expect only one of the action should incur the expenses >> of reading 19949 files from s3. Am I missing anything? >> >> Thank you! >> >> Ram >> >> >> On Thu, May 25,

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Nicholas Hakobian
) - both actions took 8 minutes each > in my scenario. I'd expect only one of the action should incur the expenses > of reading 19949 files from s3. Am I missing anything? > > Thank you! > > Ram > > > On Thu, May 25, 2017 at 1:34 AM, Steffen Schmitz < > steffens

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Steffen Schmitz
only one of the action should incur the expenses of reading 19949 files from s3. Am I missing anything? Thank you! Ram On Thu, May 25, 2017 at 1:34 AM, Steffen Schmitz mailto:steffenschm...@hotmail.de>> wrote: Hi Ram, Regarding your caching question: The data frame is evaluated lazy. That

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Ram Navan
en Schmitz wrote: > Hi Ram, > > Regarding your caching question: > The data frame is evaluated lazy. That means it isn’t cached directly on > invoking of .cache(), but on calling the first action on it (in your case > count). > Then it is loaded into memory and the rows are count

Re: Questions regarding Jobs, Stages and Caching

2017-05-25 Thread Steffen Schmitz
Hi Ram, Regarding your caching question: The data frame is evaluated lazy. That means it isn’t cached directly on invoking of .cache(), but on calling the first action on it (in your case count). Then it is loaded into memory and the rows are counted, not on the call of .cache(). On the

Questions regarding Jobs, Stages and Caching

2017-05-24 Thread ramnavan
it to create 1 job with 19949 tasks. I’d like to understand why there are three jobs instead of just one and why reading json files calls for map operation. Caching and Count(): Once spark reads 19949 json files into a dataframe (let’s call it files_df), I am calling th

Re: odd caching behavior or accounting

2017-02-09 Thread Hbf
I'm seeing the same behavior in Spark 2.0.1. Does anybody have an explanation? Thanks! Kaspar bmiller1 wrote > Hi All, > > I've recently noticed some caching behavior which I did not understand > and may or may not have indicated a bug. In short, the web UI seemed &

Issue with caching

2017-01-27 Thread Anil Langote
Hi All I am trying to cache large dataset with storage level memory and sterilization with kyro enabled when I run my spark job multiple times I get different performance at a times caching dataset spark hangs and takes forever what is wrong. The best time I got is 20 mins and some times

Re: Dataframe caching

2017-01-20 Thread रविशंकर नायर
Thanks, Will look into this. Best regards, Ravion -- Forwarded message -- From: "Muthu Jayakumar" Date: Jan 20, 2017 10:56 AM Subject: Re: Dataframe caching To: "☼ R Nair (रविशंकर नायर)" Cc: "user@spark.apache.org" I guess, this may help in your c

Re: Dataframe caching

2017-01-20 Thread Muthu Jayakumar
I guess, this may help in your case? https://spark.apache.org/docs/latest/sql-programming-guide.html#global-temporary-view Thanks, Muthu On Fri, Jan 20, 2017 at 6:27 AM, ☼ R Nair (रविशंकर नायर) < ravishankar.n...@gmail.com> wrote: > Dear all, > > Here is a requirement I am thinking of implement

Dataframe caching

2017-01-20 Thread रविशंकर नायर
Dear all, Here is a requirement I am thinking of implementing in Spark core. Please let me know if this is possible, and kindly provide your thoughts. A user executes a query to fetch 1 million records from , let's say a database. We let the user store this as a dataframe, partitioned across the

RE: AVRO File size when caching in-memory

2016-11-16 Thread Shreya Agarwal
.@microsoft.com>; user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: AVRO File size when caching in-memory It's something like the schema shown below (with several additional levels/sublevels) root |-- sentAt: long (nullable = true) |-- sharing: string (nullable

Re: AVRO File size when caching in-memory

2016-11-16 Thread Prithish
reted by spark? > A compression logic of the spark caching depends on column types. > > // maropu > > > On Wed, Nov 16, 2016 at 5:26 PM, Prithish wrote: > >> Thanks for your response. >> >> I did some more tests and I am seeing that when I have a flatter >

Re: AVRO File size when caching in-memory

2016-11-16 Thread Takeshi Yamamuro
Hi, What's the schema interpreted by spark? A compression logic of the spark caching depends on column types. // maropu On Wed, Nov 16, 2016 at 5:26 PM, Prithish wrote: > Thanks for your response. > > I did some more tests and I am seeing that when I have a flatter structure

Re: AVRO File size when caching in-memory

2016-11-16 Thread Prithish
nfo to share. > > > > Regards, > > Shreya > > > > Sent from my Windows 10 phone > > > > *From: *Prithish > *Sent: *Tuesday, November 15, 2016 11:04 PM > *To: *Shreya Agarwal > *Subject: *Re: AVRO File size when caching in-memory > > > I d

RE: AVRO File size when caching in-memory

2016-11-15 Thread Shreya Agarwal
gards, Shreya Sent from my Windows 10 phone From: Prithish<mailto:prith...@gmail.com> Sent: Tuesday, November 15, 2016 11:04 PM To: Shreya Agarwal<mailto:shrey...@microsoft.com> Subject: Re: AVRO File size when caching in-memory I did another test and noting my observations here. These w

Re: AVRO File size when caching in-memory

2016-11-15 Thread Prithish
Anyone? On Tue, Nov 15, 2016 at 10:45 AM, Prithish wrote: > I am using 2.0.1 and databricks avro library 3.0.1. I am running this on > the latest AWS EMR release. > > On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke wrote: > >> spark version? Are you using tungsten? >> >> > On 14 Nov 2016, at 10:05

Re: AVRO File size when caching in-memory

2016-11-14 Thread Prithish
I am using 2.0.1 and databricks avro library 3.0.1. I am running this on the latest AWS EMR release. On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke wrote: > spark version? Are you using tungsten? > > > On 14 Nov 2016, at 10:05, Prithish wrote: > > > > Can someone please explain why this happens?

Re: AVRO File size when caching in-memory

2016-11-14 Thread Jörn Franke
spark version? Are you using tungsten? > On 14 Nov 2016, at 10:05, Prithish wrote: > > Can someone please explain why this happens? > > When I read a 600kb AVRO file and cache this in memory (using cacheTable), it > shows up as 11mb (storage tab in Spark UI). I have tried this with different

AVRO File size when caching in-memory

2016-11-14 Thread Prithish
Can someone please explain why this happens? When I read a 600kb AVRO file and cache this in memory (using cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried this with different file sizes, and the size in-memory is always proportionate. I thought Spark compresses when using

Re: Caching broadcasted DataFrames?

2016-08-25 Thread Takeshi Yamamuro
Hi, you need to cache df1 to prevent re-computation (including disk reads) because spark re-broadcasts data every sql execution. // maropu On Fri, Aug 26, 2016 at 2:07 AM, Jestin Ma wrote: > I have a DataFrame d1 that I would like to join with two separate > DataFrames. > Since d1 is small eno

Caching broadcasted DataFrames?

2016-08-25 Thread Jestin Ma
I have a DataFrame d1 that I would like to join with two separate DataFrames. Since d1 is small enough, I broadcast it. What I understand about cache vs broadcast is that cache leads to each executor storing the partitions its assigned in memory (cluster-wide in-memory). Broadcast leads to each no

Re: Question About OFF_HEAP Caching

2016-07-18 Thread Bin Fan
.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html > > Hope that helps, > Gene > > On Mon, Jul 18, 2016 at 12:11 AM, condor join > wrote: > >> Hi All, >> >> I have some questions about OFF_HEAP Caching. In Spark 1.X when we use >> *rdd.persist

Re: Question About OFF_HEAP Caching

2016-07-18 Thread Gene Pang
how to run Spark with Alluxio: http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html Hope that helps, Gene On Mon, Jul 18, 2016 at 12:11 AM, condor join wrote: > Hi All, > > I have some questions about OFF_HEAP Caching. In Spark 1.X when we use > *rdd.persist(StorageLe

Question About OFF_HEAP Caching

2016-07-18 Thread condor join
Hi All, I have some questions about OFF_HEAP Caching. In Spark 1.X when we use rdd.persist(StorageLevel.OFF_HEAP),that means rdd caching in Tachyon(Alluxio). However,in Spark 2.X,we can directly use OFF_HEAP For Caching (https://issues.apache.org/jira/browse/SPARK-13992?jql=project%20%3D

Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-21 Thread Gene Pang
Hi, It looks like this is not related to Alluxio. Have you tried running the same job with different storage? Maybe you could increase the Spark JVM heap size to see if that helps your issue? Hope that helps, Gene On Wed, Jun 15, 2016 at 8:52 PM, Chanh Le wrote: > Hi everyone, > I added more

Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Chanh Le
Hi everyone, I added more logs for my use case: When I cached all my data 500 mil records and count. I receive this. 16/06/16 10:09:25 ERROR TaskSetManager: Total size of serialized results of 27 tasks (1876.7 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) >>> that weird because I just

Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Chanh Le
Hi Gene, I am using Alluxio 1.1.0. Spark 2.0 Preview version. Load from alluxio then cached and query for 2nd time. Spark will stuck. > On Jun 15, 2016, at 8:42 PM, Gene Pang wrote: > > Hi, > > Which version of Alluxio are you using? > > Thanks, > Gene > > On Tue, Jun 14, 2016 at 3:45 AM,

Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Gene Pang
Hi, Which version of Alluxio are you using? Thanks, Gene On Tue, Jun 14, 2016 at 3:45 AM, Chanh Le wrote: > I am testing Spark 2.0 > I load data from alluxio and cached then I query but the first query is ok > because it kick off cache action. But after that I run the query again and > it’s st

Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-14 Thread Chanh Le
I am testing Spark 2.0 I load data from alluxio and cached then I query but the first query is ok because it kick off cache action. But after that I run the query again and it’s stuck. I ran in cluster 5 nodes in spark-shell. Did anyone has this issue?

Caching table partition after join

2016-06-05 Thread Zalzberg, Idan (Agoda)
Hi, I have a complicated scenario where I can't seem to explain to spark how to handle the query in the best way. I am using spark from the thrift server so only SQL. To explain the scenario, let's assume: Table A: Key : String Value : String Table B: Key: String Value2: String Part : String [

Spark Streaming - Is window() caching DStreams?

2016-05-27 Thread Marco1982
le.com/Spark-Streaming-Is-window-caching-DStreams-tp27041.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: RDDs caching in typical machine learning use cases

2016-04-04 Thread Eugene Morozov
ll! Jean Morozov On Sun, Apr 3, 2016 at 11:34 AM, Sergey wrote: > Hi Spark ML experts! > > Do you use RDDs caching somewhere together with ML lib to speed up > calculation? > I mean typical machine learning use cases. > Train-test split, train, evaluate, apply model. > > Sergey. >

RDDs caching in typical machine learning use cases

2016-04-03 Thread Sergey
Hi Spark ML experts! Do you use RDDs caching somewhere together with ML lib to speed up calculation? I mean typical machine learning use cases. Train-test split, train, evaluate, apply model. Sergey.

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Mohamed Nadjib MAMI
t;).show() /** fails */ On 24.03.2016 13:40, Ted Yu wrote: Can you pastebin the stack trace ? If you can show snippet of your code, that would help give us more clue. Thanks On Mar 24, 2016, at 2:43 AM, Mohamed Nadjib MAMI <mailto:m...@iai.uni-bonn.de> wro

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Ted Yu
e stack trace ? > > If you can show snippet of your code, that would help give us more clue. > > Thanks > > > On Mar 24, 2016, at 2:43 AM, Mohamed Nadjib MAMI > wrote: > > Hi all, > I'm running SQL queries (sqlContext.sql()) on Parquet tables and facing a

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Mohamed Nadjib MAMI
? If you can show snippet of your code, that would help give us more clue. Thanks On Mar 24, 2016, at 2:43 AM, Mohamed Nadjib MAMI wrote: Hi all, I'm running SQL queries (sqlContext.sql()) on Parquet tables and facing a problem with table caching (sqlContext.cacheTable()), using spark-she

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Ted Yu
Can you pastebin the stack trace ? If you can show snippet of your code, that would help give us more clue. Thanks > On Mar 24, 2016, at 2:43 AM, Mohamed Nadjib MAMI wrote: > > Hi all, > I'm running SQL queries (sqlContext.sql()) on Parquet tables and facing a > proble

Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Mohamed Nadjib MAMI
Hi all, I'm running SQL queries (sqlContext.sql()) on Parquet tables and facing a problem with table caching (sqlContext.cacheTable()), using spark-shell of Spark 1.5.1. After I run the sqlContext.cacheTable(table), the sqlContext.sql(query) takes longer the first time (well, for the

streaming application redundant dag stage execution/performance/caching

2016-02-16 Thread krishna ramachandran
We have a streaming application containing approximately 12 stages every batch, running in streaming mode (4 sec batches). Each stage persists output to cassandra the pipeline stages stage 1 ---> receive Stream A --> map --> filter -> (union with another stream B) --> map --> groupbykey --> trans

Re: caching ratigs with ALS implicit

2016-02-15 Thread Sean Owen
It will need its intermediate RDDs to be cached, and it will do that internally. See the setIntermediateRDDStorageLevel method. Skim the API docs too. On Mon, Feb 15, 2016 at 9:21 PM, Roberto Pagliari wrote: > Something not clear from the documentation is weather the ratings RDD needs > to be cac

caching ratigs with ALS implicit

2016-02-15 Thread Roberto Pagliari
Something not clear from the documentation is weather the ratings RDD needs to be cached before calling ALS trainImplicit. Would there be any performance gain?

Re: ALS rating caching

2016-02-09 Thread Roberto Pagliari
Hi Nick, >From which version does that apply? I'm using 1.5.2 Thank you, From: Nick Pentreath mailto:nick.pentre...@gmail.com>> Date: Tuesday, 9 February 2016 07:02 To: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Sub

Re: ALS rating caching

2016-02-08 Thread Nick Pentreath
In the "new" ALS intermediate RDDs (including the ratings input RDD after transforming to block-partitioned ratings) is cached using intermediateRDDStorageLevel, and you can select the final RDD storage level (for user and item factors) using finalRDDStorageLevel. The old MLLIB API now calls the n

ALS rating caching

2016-02-08 Thread Roberto Pagliari
When using ALS from mllib, would it be better/recommended to cache the ratings RDD? I'm asking because when predicting products for users (for example) it is recommended to cache product/user matrices. Thank you,

Question on RDD caching

2016-02-04 Thread Vishnu Viswanath
Hello, When we call cache() or persist(MEMORY_ONLY), how does the request flow to the nodes? I am assuming this will happen: 1. Driver knows which all nodes hold the partition for the given rdd (where is this info stored?) 2. It sends a cache request to the node's executor 3. The executor will s

Re: Spark Caching Kafka Metadata

2016-02-01 Thread Benjamin Han
9 AM, Cody Koeninger wrote: > The kafka direct stream doesn't do any explicit caching. I haven't looked > through the underlying simple consumer code in the kafka project in detail, > but I doubt it does either. > > Honestly, I'd recommend not using auto created topics

Re: Spark Caching Kafka Metadata

2016-01-29 Thread Cody Koeninger
The kafka direct stream doesn't do any explicit caching. I haven't looked through the underlying simple consumer code in the kafka project in detail, but I doubt it does either. Honestly, I'd recommend not using auto created topics (it makes it too easy to pollute your topics

Spark Caching Kafka Metadata

2016-01-28 Thread asdf zxcv
able = true, but this still doesn't work consistently. This is a bit frustrating to debug as well since the topic is successfully created about 50% of the time, other times I get message "Does the topic exist?". My thinking is that Spark may be caching the list of extant kafka topics, ig

Caching in Spark

2016-01-22 Thread Sourabh Chandak
. Ideally what I would expect is that the 2nd job skips all the previous transformations prior to the cached RDD and start running from there, instead what is happening is that the entire stage in which caching was done is getting re-executed till the caching transformation and then the job continues

RE: Question in rdd caching in memory using persist

2016-01-07 Thread seemanto.barua
msure...@gmail.com' Cc: 'user@spark.apache.org' Subject: Re: Question in rdd caching in memory using persist I have a standalone cluster. spark version is 1.3.1 From: Prem Sure [mailto:premsure...@gmail.com] Sent: Thursday, January 07, 2016 12:32 PM To: Barua, Seemanto (US) Cc: spa

Re: Question in rdd caching in memory using persist

2016-01-07 Thread Prem Sure
r listed as one > of the ‘executors’ participating in holding the partitions of the rdd in > memory, the memory usage shown against the driver is 0. This I see in the > storage tab of the spark ui. > > Why is the driver shown on the ui ? Will it ever hold rdd partitions when &g

Re: How 'select name,age from TBL_STUDENT where age = 37' is optimized when caching it

2015-11-16 Thread Xiao Li
Your dataframe is cached. Thus, your plan is stored as an InMemoryRelation. You can read the logics in CacheManager.scala. Good luck, Xiao Li 2015-11-16 6:35 GMT-08:00 Todd : > Hi, > > When I cache the dataframe and run the query, > > val df = sqlContext.sql("select name,age from TBL_STUD

How 'select name,age from TBL_STUDENT where age = 37' is optimized when caching it

2015-11-16 Thread Todd
Hi, When I cache the dataframe and run the query, val df = sqlContext.sql("select name,age from TBL_STUDENT where age = 37") df.cache() df.show println(df.queryExecution) I got the following execution plan,from the optimized logical plan,I can see the whole analyzed logical

Re: Caching causes later actions to get stuck

2015-11-01 Thread Sampo Niskanen
documents from a collection > to an RDD and caching that (though according to the error message it > doesn't fully fit). The first action on it always succeeds, but latter > actions fail. I just upgraded from Spark 0.9.x to 1.5.1, and didn't have > that problem w

Caching causes later actions to get stuck

2015-10-30 Thread Sampo Niskanen
lection to an RDD and caching that (though according to the error message it doesn't fully fit). The first action on it always succeeds, but latter actions fail. I just upgraded from Spark 0.9.x to 1.5.1, and didn't have that problem with the older version. The output I get: scala

Distributed caching of a file in SPark Streaming

2015-10-21 Thread swetha
? SparkContext.addFile() SparkFiles.get(fileName) Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Distributed-caching-of-a-file-in-SPark-Streaming-tp25157.html Sent from the Apache Spark User List mailing list archive at Nabble.com

  1   2   3   >