Re: Little idea needed

Aakash Basu Wed, 20 Jul 2016 13:23:52 -0700

Your second point: That's going to be a bottleneck for all the programs
which will fetch the data from that folder and again add extra filters into
the DF. I want to finish that off, there itself.


And that merge logic is weak when one table is huge and the other is very
small (which is the case here), it literally gulps memory and time.

And business won't allow Hive and all else to be used AT ALL, since we may
shift to EMR where Hive has compatibility issues maybe (need to check).
On 20-Jul-2016 1:27 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:

Well as far as I know there is some update statement planned for spark, but
not sure which release. You could alternatively use Hive+Orc.
Another alternative would be to add the deltas in a separate file and when
accessing the table filtering out the double entries. From time to time you
could have a merge process creating one file out of all the deltas.

On 19 Jul 2016, at 21:27, Aakash Basu <raj2coo...@gmail.com> wrote:

Hi all,

I'm trying to pull a full table from oracle, which is huge with some 10
million records which will be the initial load to HDFS.

Then I will do delta loads everyday in the same folder in HDFS.

Now, my query here is,

DAY 0 - I did the initial load (full dump).

DAY 1 - I'll load only that day's data which has suppose 10 records (5 old
with some column's value altered and 5 new).

Here, my question is, how will I push this file to HDFS through Spark code,
if I do append, it will create duplicates (which i don't want), if i keep
separate files and while using it in other program am giving the path of it
as folder which contains all files /. But in this case also the
registerTempTable will have duplicates for those 5 old rows.

What is the BEST logic to be applied here?

I tried to resolve this by doing a search in that file of the records if
matching load the new ones by deleting the old, but this will be time
consuming for such a huge record, right?

Please help!

Thanks,
Aakash.

Re: Little idea needed

Reply via email to