I have a parquet file with millions of records and hundreds of fields that I
will be extracting from a cluster with more resources. I need to take that
data,derive a set of tables from only some of the fields and import them
using a smaller cluster

The smaller cluster cannot load in memory the entire parquet file , but it
can load the derived tables.

if I am reading a parquet file , and I only select a few fields , how much
computing power do I need compared to all the columns? is it different?  Do
I need more or less computing power depending on the number of columns I
select , or does it depend more on the raw source itself and the number of
columns it contains?

One suggestion I received from a college was to derive the tables using the
larger cluster and just import them in the smaller cluster , but I was
wondering if that's really necessary considering that after the import , I
won't be use the dumps anymore.

I hope my question makes sense. 

Thanks for your help!







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to