Re: Caching

2020-12-07 Thread Lalwani, Jayesh
* Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will optimize the query. No. That would mean that Spark will need to cache DF1. Spark won’t cache dataframes unless you ask it to, even if it knows that the same dataframe is being used twice. This is be

Re: Caching

2020-12-07 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
You are using same csv twice? Отправлено с iPhone > 7 дек. 2020 г., в 18:32, Amit Sharma написал(а): > >  > Hi All, I am using caching in my code. I have a DF like > val DF1 = read csv. > val DF2 = DF1.groupBy().agg().select(.) > > Val DF3 = read csv .join(DF1).join(DF2) > DF3 .save.

Re: Caching

2020-12-07 Thread Amit Sharma
Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will optimize the query. Thanks Amit On Mon, Dec 7, 2020 at 1:16 PM Lalwani, Jayesh wrote: > Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, > without caching, Spark will read the CS

Re: Caching

2020-12-07 Thread Amit Sharma
Sean, you mean if df is used more than once in transformation then use cache. But be frankly that is also not true because at many places even if df is used once with caching and without cache also it gives same result. How to decide should we use cache or not Thanks Amit On Mon, Dec 7, 2020 at

Re: Caching

2020-12-07 Thread Lalwani, Jayesh
Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, without caching, Spark will read the CSV twice: Once to load it for DF1, and once to load it for DF2. When you add a cache on DF1 or DF2, it reads from CSV only once. You might want to look at doing a windowed query on D

Re: Caching

2020-12-07 Thread Sean Owen
No, it's not true that one action means every DF is evaluated once. This is a good counterexample. On Mon, Dec 7, 2020 at 11:47 AM Amit Sharma wrote: > Thanks for the information. I am using spark 2.3.3 There are few more > questions > > 1. Yes I am using DF1 two times but at the end action is

Re: Caching

2020-12-07 Thread Amit Sharma
Thanks for the information. I am using spark 2.3.3 There are few more questions 1. Yes I am using DF1 two times but at the end action is one on DF3. In that case action of DF1 should be just 1 or it depends how many times this dataframe is used in transformation. I believe even if we use a dataf

RE: Caching

2020-12-07 Thread Theodoros Gkountouvas
Hi Amit, One action might use the same DataFrame more than once. You can look at your LogicalPlan by executing DF3.explain (arguments different depending the version of Spark you are using) and see how many times you need to compute DF2 or DF1. Given the information you have provided I suspect

Re: Caching tables in spark

2019-08-28 Thread Tzahi File
I mean two separate spark jobs On Wed, Aug 28, 2019 at 2:25 PM Subash Prabakar wrote: > When you mean by process is it two separate spark jobs? Or two stages > within same spark code? > > Thanks > Subash > > On Wed, 28 Aug 2019 at 19:06, wrote: > >> Take a look at this article >> >> >> >> >>

Re: Caching tables in spark

2019-08-28 Thread Subash Prabakar
When you mean by process is it two separate spark jobs? Or two stages within same spark code? Thanks Subash On Wed, 28 Aug 2019 at 19:06, wrote: > Take a look at this article > > > > > https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html > > > > *From:* Tzahi File >

RE: Caching tables in spark

2019-08-28 Thread email
Take a look at this article https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html From: Tzahi File Sent: Wednesday, August 28, 2019 5:18 AM To: user Subject: Caching tables in spark Hi, Looking for your knowledge with some question. I have 2 differe

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-24 Thread Sonal Goyal
Without knowing too much about your application, it would be hard to say. Maybe it is working faster in local as there is no shuffling etc? The spark.ui would be your best bet to know what stage is slowing things down. On Fri 24 Aug, 2018, 3:26 PM Guillermo Ortiz, wrote: > Another test I just di

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-24 Thread Guillermo Ortiz
Another test I just did it's to execute with local[X] and this problem doesn't happen. Communication problems? 2018-08-23 22:43 GMT+02:00 Guillermo Ortiz : > it's a complex DAG before the point I cache the RDD, they are some joins, > filter and maps before caching data, but most of the times it

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-23 Thread Guillermo Ortiz
it's a complex DAG before the point I cache the RDD, they are some joins, filter and maps before caching data, but most of the times it doesn't take almost time to do it. I could understand if it would take the same time all the times to process or cache the data. Besides it seems random and they a

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-23 Thread Sonal Goyal
How are these small RDDs created? Could the blockage be in their compute creation instead of their caching? Thanks, Sonal Nube Technologies On Thu, Aug 23, 2018 at 6:38 PM, Guillermo Ortiz wrote: > I use spark with caching with

Re: Caching broadcasted DataFrames?

2016-08-25 Thread Takeshi Yamamuro
Hi, you need to cache df1 to prevent re-computation (including disk reads) because spark re-broadcasts data every sql execution. // maropu On Fri, Aug 26, 2016 at 2:07 AM, Jestin Ma wrote: > I have a DataFrame d1 that I would like to join with two separate > DataFrames. > Since d1 is small eno

Re: caching ratigs with ALS implicit

2016-02-15 Thread Sean Owen
It will need its intermediate RDDs to be cached, and it will do that internally. See the setIntermediateRDDStorageLevel method. Skim the API docs too. On Mon, Feb 15, 2016 at 9:21 PM, Roberto Pagliari wrote: > Something not clear from the documentation is weather the ratings RDD needs > to be cac

Re: Caching causes later actions to get stuck

2015-11-01 Thread Sampo Niskanen
Hi, Any ideas what's going wrong or how to fix it? Do I have to downgrade to 0.9.x to be able to use Spark? Best regards, *Sampo Niskanen* *Lead developer / Wellmo* sampo.niska...@wellmo.com +358 40 820 5291 On Fri, Oct 30, 2015 at 4:57 PM, Sampo Niskanen wrote: > Hi, > > I'm

Re: caching DataFrames

2015-09-23 Thread Zhang, Jingyu
Thanks Hemant, I will generate a total report (dfA) with many columns from log data. After the report (A) done. I will generate many detail reports (dfA1-dfAi) base on the subset of the total report (dfA), those detail reports using aggregate and window functions, according on different rules. Ho

Re: caching DataFrames

2015-09-23 Thread Hemant Bhanawat
hit send button too early... However, why would you want to cache a dataFrame that is subset of already cached dataFrame. If dfA is cached, and dfA1 is created by applying some transformation on dfA, actions on dfA1 will use cache of dfA. val dfA1 = dfA.filter($"_1" > 50) // this will run

Re: caching DataFrames

2015-09-23 Thread Hemant Bhanawat
Two dataframes do not share cache storage in Spark. Hence it's immaterial that how two dataFrames are related to each other. Both of them are going to consume memory based on the data that they have. So for your A1 and B1 you would need extra memory that would be equivalent to half the memory of A

Re: Caching intermediate results in Spark ML pipeline?

2015-09-18 Thread Jingchu Liu
Thanks buddy I'll try it out in my project. Best, Lewis 2015-09-16 13:29 GMT+08:00 Feynman Liang : > If you're doing hyperparameter grid search, consider using > ml.tuning.CrossValidator which does cache the dataset >

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
If you're doing hyperparameter grid search, consider using ml.tuning.CrossValidator which does cache the dataset . Otherwise, perhaps you can elaborate more on your particular use

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Jingchu Liu
Yeah I understand on the low-level we should do as you said. But since ML pipeline is a high-level API, it is pretty natural to expect the ability to recognize overlapping parameters between successive runs. (Actually, this happen A LOT when we have lots of hyper-params to search for) I can also

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
Nope, and that's intentional. There is no guarantee that rawData did not change between intermediate calls to searchRun, so reusing a cached data1 would be incorrect. If you want data1 to be cached between multiple runs, you have a few options: * cache it first and pass it in as an argument to sea

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Jingchu Liu
Hey Feynman, I doubt DF persistence will work in my case. Let's use the following code: == def searchRun( params = [param1, param2] ) data1 = hashing1.transform(rawData, param1) data1.cache() data2 = hashing2.transform(data1, param2)

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
You can persist the transformed Dataframes, for example val data : DF = ... val hashedData = hashingTF.transform(data) hashedData.cache() // to cache DataFrame in memory Future usage of hashedData read from an in-memory cache now. You can also persist to disk, eg: hashedData.write.parquet(FileP

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Jingchu Liu
Hey Feynman, Thanks for your response, but I'm afraid "model save/load" is not exactly the feature I'm looking for. What I need to cache and reuse are the intermediate outputs of transformations, not transformer themselves. Do you know any related dev. activities or plans? Best, Lewis 2015-09-1

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
Lewis, Many pipeline stages implement save/load methods, which can be used if you instantiate and call the underlying pipeline stages `transform` methods individually (instead of using the Pipeline.setStages API). See associated JIRAs . Pipeline p

Re: Caching in spark

2015-07-12 Thread Akhil Das
There was a discussion happened on that earlier, let me re-post it for you. For the following code: val *df* = sqlContext.parquetFile(path) *df* remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code: val *cdf* = df.cache() *cdf* is

Re: Caching in spark

2015-07-12 Thread Ruslan Dautkhanov
Hi Akhil, It's interesting if RDDs are stored internally in a columnar format as well? Or it is only when an RDD is cached in SQL context, it is converted to columnar format. What about data frames? Thanks! -- Ruslan Dautkhanov On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das wrote: > > https://s

Re: Caching in spark

2015-07-10 Thread Akhil Das
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory Thanks Best Regards On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar wrote: > Hi Guys, > > Can any one please share me how to use caching feature of spark via spark > sql queries? > > -Vinod >

Re: Caching parquet table (with GZIP) on Spark 1.3.1

2015-06-07 Thread Cheng Lian
Is it possible that some Parquet files of this data set have different schema as others? Especially those ones reported in the exception messages. One way to confirm this is to use [parquet-tools] [1] to inspect these files: $ parquet-schema Cheng [1]: https://github.com/apache/parquet

Re: Caching and Actions

2015-04-09 Thread Sameer Farooqui
Your point #1 is a bit misleading. >> (1) The mappers are not executed in parallel when processing independently the same RDD. To clarify, I'd say: In one stage of execution, when pipelining occurs, mappers are not executed in parallel when processing independently the same RDD partition. On Thu

Re: Caching and Actions

2015-04-09 Thread spark_user_2015
That was helpful! The conclusion: (1) The mappers are not executed in parallel when processing independently the same RDD. (2) The best way seems to be (if enough memory is available and an action is applied to d1 and d2 later on) val d1 = data.map((x,y,z) => (x,y)).cache val d2 = d1

Re: Caching and Actions

2015-04-09 Thread Sameer Farooqui
Hi there, You should be selective about which RDDs you cache and which you don't. A good candidate RDD for caching is one that you reuse multiple times. Commonly the reuse is for iterative machine learning algorithms that need to take multiple passes over the data. If you try to cache a really la

Re: Caching and Actions

2015-04-09 Thread Bojan Kostic
You can use toDebugString to see all the steps in job. Best Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

2014-07-25 Thread Rico
I could find out the issue. In fact, I did not realize before that when loaded into memory, the data is deserialized. As a result, what seems to be a 21Gb dataset occupies 77Gb in memory. Details about this is clearly explained in the guide on serialization and memory tuning

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

2014-07-22 Thread rindra
Hello Andrew, Thank you very much for your great tips. Your solution worked perfectly. In fact, I was not aware that the right option for local mode is --driver.memory 1g Cheers, Rindra On Mon, Jul 21, 2014 at 11:23 AM, Andrew Or-2 [via Apache Spark User List] < ml-node+s1001560n10336...@n3.n

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

2014-07-21 Thread Andrew Or
Hi Rindra, Depending on what you're doing with your groupBy, you may end up inflating your data quite a bit. Even if your machine has 16G, by default spark-shell only uses 512M, and the amount used for storing blocks is only 60% of that (spark.storage.memoryFraction), so this space becomes ~300M.

Re: Caching in graphX

2014-05-13 Thread ankurdave
Unfortunately it's very difficult to get uncaching right with GraphX due to the complicated internal dependency structure that it creates. It's necessary to know exactly what operations you're doing on the graph in order to unpersist correctly (i.e., in a way that avoids recomputation). I have a p