I'm working on an ETL that requires me to import a continuous stream of CSVs 
into hadoop / hive cluster. For now let's assume the CSVs need to end up in the 
same database.table. But the newer CSVs might introduce additional columns 
(hence I want the script to alter the table and add additional columns as it 
encounters them).



e.g.



csv1.csv

a,b

1,2

2,4



csv2.csv

a,b,c

3,8,0

4,10,2



what is the best way to write such ETL into hive.  should I use hive with -f to 
spin up scripts like:


upsert.hql:

CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\,';

SET hive.cli.errors.ignore=true;

ALTER TABLE mydbname.testtable ADD COLUMNS (b string);

SET hive.cli.errors.ignore=false;

LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE 
mydbname.testtable;



(disadvantage is that when LAD DATA encounters invalid column string for 
integer field the value NULL is inserted and I do not get notified)

should I do it from beeline?

should I write a pig script?

should I write a java program?


should I use programs like: https://github.com/enahwe/Csv2Hive


what's the recommended approach here?

Reply via email to