I have a parquet file with millions of records and hundreds of fields that I will be extracting from a cluster with more resources. I need to take that data,derive a set of tables from only some of the fields and import them using a smaller cluster
The smaller cluster cannot load in memory the entire parquet file , but it can load the derived tables. if I am reading a parquet file , and I only select a few fields , how much computing power do I need compared to all the columns? is it different? Do I need more or less computing power depending on the number of columns I select , or does it depend more on the raw source itself and the number of columns it contains? One suggestion I received from a college was to derive the tables using the larger cluster and just import them in the smaller cluster , but I was wondering if that's really necessary considering that after the import , I won't be use the dumps anymore. I hope my question makes sense. Thanks for your help! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org