I second this. I've generally found anything else to be disappointing when working with data which is at all funky.
On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <apivova...@gmail.com> wrote: > Time to use Spark and Spark-Sql in addition to Hive? > It's probably going to happen sooner or later anyway. > > I sent you Spark solution yesterday. (you just need to write > unbzip2AndCsvToListOfArrays(file: String): List[Array[String]] function > using BZip2CompressorInputStream and Super CSV API) > you can download spark, open spark-shell and run/debug the program on a > single computer > > and then run it on cluster if needed (e.g. Amazon EMR can spin up Spark > cluster in 7 min) > > On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <bryan.ger...@pnnl.gov> > wrote: > >> 1. hdfs dfs -copyFromLocal /incoming/files/*.bz2 hdfs:// >> host.name/data/stg/table/ >> >> 2. CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde >> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION >> ‘/data/stg/table/’ >> >> 3. CREATE TABLE <table> (cols…) STORE AS ORC tblproperties >> ("orc.compress"="ZLIB"); >> >> 4. INSERT INTO TABLE <table> SELECT cols, udf1(cola), >> udf2(colb),functions(),etc. FROM ext_<table> >> >> 5. Delete files from hdfs://host.name/data/stg/table/ >> >> >> >> This has been working quite well, until our newest data contains fields >> with embedded newlines. >> >> >> >> We are now looking into options further up the pipeline to see if we can >> condition the data earlier in the process. >> >> >> >> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk] >> *Sent:* Wednesday, January 13, 2016 10:34 AM >> >> *To:* user@hive.apache.org >> *Subject:* RE: Loading data containing newlines >> >> >> >> Thanks Brian. >> >> >> >> Just to clarify do you use something like below? >> >> >> >> 1. hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs:// >> rhes564.hedat.net:9000/misc/t.bcp >> >> 2. CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) >> COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY >> ',' STORED AS ORC >> >> >> >> Cheers, >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> *Sybase ASE 15 Gold Medal Award 2008* >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >> >> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >> 15", ISBN 978-0-9563693-0-7*. >> >> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >> 978-0-9759693-0-4* >> >> *Publications due shortly:* >> >> *Complex Event Processing in Heterogeneous Environments*, ISBN: >> 978-0-9563693-3-8 >> >> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >> one out shortly >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus >> free, therefore neither Peridale Ltd, its subsidiaries nor their employees >> accept any responsibility. >> >> >> >> *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov] >> *Sent:* 13 January 2016 18:12 >> *To:* user@hive.apache.org >> *Subject:* RE: Loading data containing newlines >> >> >> >> We are pushing the compressed text files into HDFS directory for Hive >> EXTERNAL table, then using an INSERT on the table using ORC storage. We are >> letting Hive handle the ORC file creation process. >> >> >> >> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk <m...@peridale.co.uk>] >> >> *Sent:* Tuesday, January 12, 2016 4:41 PM >> *To:* user@hive.apache.org >> *Subject:* RE: Loading data containing newlines >> >> >> >> Hi Bryan, >> >> >> >> As a matter of interest are you loading text files into local directories >> in encrypted format at all and then push it into HDFS/Hive as ORC? >> >> >> >> Thanks >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> *Sybase ASE 15 Gold Medal Award 2008* >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >> >> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >> 15", ISBN 978-0-9563693-0-7*. >> >> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >> 978-0-9759693-0-4* >> >> *Publications due shortly:* >> >> *Complex Event Processing in Heterogeneous Environments*, ISBN: >> 978-0-9563693-3-8 >> >> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >> one out shortly >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus >> free, therefore neither Peridale Ltd, its subsidiaries nor their employees >> accept any responsibility. >> >> >> >> *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov >> <bryan.ger...@pnnl.gov>] >> *Sent:* 12 January 2016 17:41 >> *To:* user@hive.apache.org >> *Subject:* Loading data containing newlines >> >> >> >> We are attempting to load CSV text files (compressed to bz2) containing >> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format >> tables. Data volume is ~1TB/day, we are really trying to avoid unpacking >> them to condition the data. >> >> >> >> A few days of research has us ready to implement custom input/output >> formats to handle the ingest. Any other suggestions that may be less >> effort with low impact to load times? >> >> >> >> Thanks, >> >> Bryan G. >> > > -- Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> Latest news <http://www.handy.com/press> at Handy Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity