Seems like an interesting problem to solve!

If I have understood it correctly, you have 10114 files each with the
structure

rowid, colA
r1, a
r2, b
r3, c
.......5 million rows

if you union them, you will have
rowid, colA, colB
r1, a, null
r2, b, null
r3, c, null
r1, null, d
r2, null, e
r3, null, f

Will a window partition by rowid and max on column values not work ?

Cheers,
Sonal
https://github.com/zinggAI/zingg



On Thu, Apr 21, 2022 at 6:50 AM Sean Owen <sro...@gmail.com> wrote:

> Oh, Spark directly supports upserts (with the right data destination) and
> yeah you could do this as 10000+ updates to a table without any pivoting,
> etc. It'd still end up being 10K+ single joins along the way but individual
> steps are simpler. It might actually be pretty efficient I/O wise as
> columnar formats would not rewrite any other data on a write like this.
>
> On Wed, Apr 20, 2022 at 8:09 PM Andrew Davidson <aedav...@ucsc.edu> wrote:
>
>> Hi Sean
>>
>>
>>
>> My “insert” solution is hack that might work give we can easily spin up a
>> single VM with a crazy amouts of memory. I would prefer to see a
>> distributed solution. It is just a matter of time before someone want to
>> create an even bigger table using cbind.
>>
>>
>>
>> I understand you probably already know a lot about traditional RDBS’s.
>> Much of my post is back ground for others
>>
>>
>>
>> I used to do some of classic relational database work before tools like
>> Hadoop, spark and NoSQL became available .
>>
>>
>>
>> The standard operations on a single table in a relation database are
>>
>>
>>
>> Insert “row”. This is similar to spark union.  Typically primary keys in
>>  in rbdms tables are indexed  to enable quick look up. So insert is
>> probably not 1 for. 1 with union. The row may not simply be appended to the
>> end of the table.
>>
>>
>>
>> Update a “row”
>>
>> Delete a “row”
>>
>> Select “rows where”
>>
>>
>>
>> Rdms server enable row and table level locking. Data must always be in a
>> consistent state. You must commit or abort you changes for them to persist
>> and to release locks on the data. Locks are required because you have a
>> single resource and may user requesting service simultaneously. This is
>> very different from Spark
>>
>>
>>
>> Storage and memory used to be really expensive so often people tried to
>> create “1st normal form” schemas. I.E. no duplicate data to reduce
>> hardware cost.  1st normal design require you to use joins to the get
>> data table you want. Joins are expensive. Often design duplicated some data
>> to improve performance by minimize the number of joins required. Duplicate
>> data make maintaining consistency harder. There are other advantages to
>> normalized data design and as we are all aware in the bigdata world lots of
>> disadvantages. The dbms ran on a single big machine. Join was not
>> implemented as distributed map/reduce.
>>
>>
>>
>> So My idea is use a traditional RDMS server: my final table will have 5
>> million rows and 10,114 columns.
>>
>>    1. Read the column vector from each of 10,114 data files
>>    2. insert the column vector as a row in the table
>>       1. I read a file that has a single element on each line. All I
>>       need to do is replace \n with ,
>>    3. Now I have table with 10,115 rows and 5 million columns
>>    4. The row id (primary key) is the original file name
>>    5. The columns are the row ids in the original column vectors
>>    6. Now all I need to do is pivot this single table to get what I
>>    want. This is the only join or map/reduce like operation
>>    7. A table with 5million rows and 10,114 columns
>>
>>
>>
>>
>>
>> My final table is about 220 gb. I know at google my I have quota for up 2
>> mega mem machines. Each one has some think like 1.4 Tb of memory
>>
>>
>>
>> Kind regards
>>
>>
>>
>> Andy
>>
>>
>>
>>

Reply via email to