Seems like an interesting problem to solve! If I have understood it correctly, you have 10114 files each with the structure
rowid, colA r1, a r2, b r3, c .......5 million rows if you union them, you will have rowid, colA, colB r1, a, null r2, b, null r3, c, null r1, null, d r2, null, e r3, null, f Will a window partition by rowid and max on column values not work ? Cheers, Sonal https://github.com/zinggAI/zingg On Thu, Apr 21, 2022 at 6:50 AM Sean Owen <sro...@gmail.com> wrote: > Oh, Spark directly supports upserts (with the right data destination) and > yeah you could do this as 10000+ updates to a table without any pivoting, > etc. It'd still end up being 10K+ single joins along the way but individual > steps are simpler. It might actually be pretty efficient I/O wise as > columnar formats would not rewrite any other data on a write like this. > > On Wed, Apr 20, 2022 at 8:09 PM Andrew Davidson <aedav...@ucsc.edu> wrote: > >> Hi Sean >> >> >> >> My “insert” solution is hack that might work give we can easily spin up a >> single VM with a crazy amouts of memory. I would prefer to see a >> distributed solution. It is just a matter of time before someone want to >> create an even bigger table using cbind. >> >> >> >> I understand you probably already know a lot about traditional RDBS’s. >> Much of my post is back ground for others >> >> >> >> I used to do some of classic relational database work before tools like >> Hadoop, spark and NoSQL became available . >> >> >> >> The standard operations on a single table in a relation database are >> >> >> >> Insert “row”. This is similar to spark union. Typically primary keys in >> in rbdms tables are indexed to enable quick look up. So insert is >> probably not 1 for. 1 with union. The row may not simply be appended to the >> end of the table. >> >> >> >> Update a “row” >> >> Delete a “row” >> >> Select “rows where” >> >> >> >> Rdms server enable row and table level locking. Data must always be in a >> consistent state. You must commit or abort you changes for them to persist >> and to release locks on the data. Locks are required because you have a >> single resource and may user requesting service simultaneously. This is >> very different from Spark >> >> >> >> Storage and memory used to be really expensive so often people tried to >> create “1st normal form” schemas. I.E. no duplicate data to reduce >> hardware cost. 1st normal design require you to use joins to the get >> data table you want. Joins are expensive. Often design duplicated some data >> to improve performance by minimize the number of joins required. Duplicate >> data make maintaining consistency harder. There are other advantages to >> normalized data design and as we are all aware in the bigdata world lots of >> disadvantages. The dbms ran on a single big machine. Join was not >> implemented as distributed map/reduce. >> >> >> >> So My idea is use a traditional RDMS server: my final table will have 5 >> million rows and 10,114 columns. >> >> 1. Read the column vector from each of 10,114 data files >> 2. insert the column vector as a row in the table >> 1. I read a file that has a single element on each line. All I >> need to do is replace \n with , >> 3. Now I have table with 10,115 rows and 5 million columns >> 4. The row id (primary key) is the original file name >> 5. The columns are the row ids in the original column vectors >> 6. Now all I need to do is pivot this single table to get what I >> want. This is the only join or map/reduce like operation >> 7. A table with 5million rows and 10,114 columns >> >> >> >> >> >> My final table is about 220 gb. I know at google my I have quota for up 2 >> mega mem machines. Each one has some think like 1.4 Tb of memory >> >> >> >> Kind regards >> >> >> >> Andy >> >> >> >>