I'm working on an ETL that requires me to import a continuous stream of CSVs into hadoop / hive cluster. For now let's assume the CSVs need to end up in the same database.table. But the newer CSVs might introduce additional columns (hence I want the script to alter the table and add additional columns as it encounters them).
e.g. csv1.csv a,b 1,2 2,4 csv2.csv a,b,c 3,8,0 4,10,2 what is the best way to write such ETL into hive. should I use hive with -f to spin up scripts like: upsert.hql: CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,'; SET hive.cli.errors.ignore=true; ALTER TABLE mydbname.testtable ADD COLUMNS (b string); SET hive.cli.errors.ignore=false; LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE mydbname.testtable; (disadvantage is that when LAD DATA encounters invalid column string for integer field the value NULL is inserted and I do not get notified) should I do it from beeline? should I write a pig script? should I write a java program? should I use programs like: https://github.com/enahwe/Csv2Hive what's the recommended approach here?