Use cache or persist. The dataframe will be materialized when the 1st action is called and then be reused from memory for each following usage
Le 1 mai 2017 4:51 PM, "Saulo Ricci" <infsau...@gmail.com> a écrit : > Hi, > > > I have the following code that is reading a table to a apache spark > DataFrame: > > val df = spark.read.format("jdbc") > .option("url","jdbc:postgresql:host/database") > .option("dbtable","tablename").option("user","username") > .option("password", "password") > .load() > > When I first invoke df.count() I get a smaller number than the next time > I invoke the same count method. > > Why this happen? > > Doesn't Spark load a snapshot of my table in a DataFrame on my Spark > Cluster when I first read that table? > > My table on postgres keeps being fed and it seems my data frame is > reflecting this behavior. > > How should I manage to load just a static snapshot my table to spark's > DataFrame by the time `read` method was invoked? > > > Any help is appreciated, > > -- > Saulo >