>From Stackoverflow: from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType
sc = SparkContext(conf=SparkConf()) spark = SparkSession(sc) # Need to use SparkSession(sc) to createDataFrame schema = StructType([ StructField("column1",StringType(),True), StructField("column2",StringType(),True) ]) empty = spark.createDataFrame(sc.emptyRDD(), schema) empty = empty.unionAll(addOndata) Best, Ravion On Sun, Jul 8, 2018 at 10:44 AM Shmuel Blitz <shmuel.bl...@similarweb.com> wrote: > Hi Dimitris, > > Could you explain your use case in a bit more details? > > What you are asking for, if I understand you correctly, is not the advised > way to go about. > > If you're running analytics and expect their output to be a Dataframe with > the specified columns, then you should compose your queries in such a way > that they result in a DataFrame. > > If your preparing data to be analyzed (i.e. getting the input ready for > manipulation), then I expect you to be doing one of the following: > a. Read in the data using one of Spark's provided input APIs (e.g. reading > a parquet file directly into a DataFrame) > b. Read/prepare your data as a standard collection in your language > (Python, in your case, but the same in Scala/Java/etc.), and then use > Spark's API to parallelize the data and/or convert it into a DataFrame. > > That way or another, you want to be using Spark API for work that should > be distributed to workers (heavy load, large amounts of data), and use your > native language API, which usually is much more powerful, to run > bootstrapping and light-weight preparations. > > Regards, > Shmuel > > On Sat, Jun 30, 2018 at 6:51 PM Apostolos N. Papadopoulos < > papad...@csd.auth.gr> wrote: > >> Hi Dimitri, >> >> you can do the following: >> >> 1. create an initial dataframe from an empty csv >> >> 2. use "union" to insert new rows >> >> Do not forget that Spark cannot replace a DBMS. Spark is mainly be used >> for analytics. >> >> If you need select/insert/delete/update capabilities, perhaps you should >> look at a DBMS. >> >> >> Another alternative, in case you need "append only" semantics, is to use >> streaming or structured streaming. >> >> >> regards, >> >> Apostolos >> >> >> >> >> On 30/06/2018 05:46 μμ, dimitris plakas wrote: >> > I am new to Pyspark and want to initialize a new empty dataframe with >> > sqlContext() with two columns ("Column1", "Column2"), and i want to >> > append rows dynamically in a for loop. >> > Is there any way to achieve this? >> > >> > Thank you in advance. >> >> -- >> Apostolos N. Papadopoulos, Associate Professor >> Department of Informatics >> Aristotle University of Thessaloniki >> Thessaloniki, GREECE >> tel: ++0030312310991918 >> email: papad...@csd.auth.gr >> twitter: @papadopoulos_ap >> web: http://delab.csd.auth.gr/~apostol >> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > -- > Shmuel Blitz > Big Data Developer > Email: shmuel.bl...@similarweb.com > www.similarweb.com > <https://www.similarweb.com?utm_source=WiseStamp&utm_medium=email&utm_term&utm_content&utm_campaign=signature> > > <https://www.facebook.com/SimilarWeb/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature> > <https://www.linkedin.com/company/429838/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature> > <https://twitter.com/similarweb?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature> >