You can open a file as an RDD of lines, and map whatever custom tokenisation function you want over it; alternatively you can partition down to a reasonable size and use map_partitions to map the standard python csv parser over the partitions.
In general, the advantage of spark is that you can do anything you like rather than being limited to a specific set of primitives. On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote: > Hi Marcin, > > > > Can you be specific in what way Spark is better suited for this operation > compared to Hive? > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Marcin Tustin [mailto:mtus...@handybook.com] > *Sent:* 15 January 2016 21:39 > *To:* user@hive.apache.org > *Subject:* Re: Loading data containing newlines > > > > I second this. I've generally found anything else to be disappointing when > working with data which is at all funky. > > > > On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <apivova...@gmail.com> > wrote: > > Time to use Spark and Spark-Sql in addition to Hive? > > It's probably going to happen sooner or later anyway. > > > > I sent you Spark solution yesterday. (you just need to write > unbzip2AndCsvToListOfArrays(file: > String): List[Array[String]] function using BZip2CompressorInputStream > and Super CSV API) > > you can download spark, open spark-shell and run/debug the program on a > single computer > > > > and then run it on cluster if needed (e.g. Amazon EMR can spin up Spark > cluster in 7 min) > > > > On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <bryan.ger...@pnnl.gov> > wrote: > > 1. hdfs dfs -copyFromLocal /incoming/files/*.bz2 hdfs:// > host.name/data/stg/table/ > > 2. CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde > 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION > ‘/data/stg/table/’ > > 3. CREATE TABLE <table> (cols…) STORE AS ORC tblproperties > ("orc.compress"="ZLIB"); > > 4. INSERT INTO TABLE <table> SELECT cols, udf1(cola), > udf2(colb),functions(),etc. FROM ext_<table> > > 5. Delete files from hdfs://host.name/data/stg/table/ > > > > This has been working quite well, until our newest data contains fields > with embedded newlines. > > > > We are now looking into options further up the pipeline to see if we can > condition the data earlier in the process. > > > > *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk] > *Sent:* Wednesday, January 13, 2016 10:34 AM > > > *To:* user@hive.apache.org > *Subject:* RE: Loading data containing newlines > > > > Thanks Brian. > > > > Just to clarify do you use something like below? > > > > 1. hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs:// > rhes564.hedat.net:9000/misc/t.bcp > > 2. CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) COMMENT > 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED > AS ORC > > > > Cheers, > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Ltd, its subsidiaries nor their employees > accept any responsibility. > > > > *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov] > *Sent:* 13 January 2016 18:12 > *To:* user@hive.apache.org > *Subject:* RE: Loading data containing newlines > > > > We are pushing the compressed text files into HDFS directory for Hive > EXTERNAL table, then using an INSERT on the table using ORC storage. We are > letting Hive handle the ORC file creation process. > > > > *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk <m...@peridale.co.uk>] > > *Sent:* Tuesday, January 12, 2016 4:41 PM > *To:* user@hive.apache.org > *Subject:* RE: Loading data containing newlines > > > > Hi Bryan, > > > > As a matter of interest are you loading text files into local directories > in encrypted format at all and then push it into HDFS/Hive as ORC? > > > > Thanks > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Ltd, its subsidiaries nor their employees > accept any responsibility. > > > > *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov > <bryan.ger...@pnnl.gov>] > *Sent:* 12 January 2016 17:41 > *To:* user@hive.apache.org > *Subject:* Loading data containing newlines > > > > We are attempting to load CSV text files (compressed to bz2) containing > newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format > tables. Data volume is ~1TB/day, we are really trying to avoid unpacking > them to condition the data. > > > > A few days of research has us ready to implement custom input/output > formats to handle the ingest. Any other suggestions that may be less > effort with low impact to load times? > > > > Thanks, > > Bryan G. > > > > > > > > Want to work at Handy? Check out our culture deck and open roles > <http://www.handy.com/careers> > > Latest news <http://www.handy.com/press> at Handy > > Handy just raised $50m > <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> > led > by Fidelity > > > > [image: Image removed by sender.] > -- Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> Latest news <http://www.handy.com/press> at Handy Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity