You can open a file as an RDD of lines, and map whatever custom
tokenisation function you want over it; alternatively you can partition
down to a reasonable size and use map_partitions to map the standard python
csv parser over the partitions.

In general, the advantage of spark is that you can do anything you like
rather than being limited to a specific set of primitives.

On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <m...@peridale.co.uk>
wrote:

> Hi Marcin,
>
>
>
> Can you be specific in what way Spark is better suited for this operation
> compared to Hive?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Marcin Tustin [mailto:mtus...@handybook.com]
> *Sent:* 15 January 2016 21:39
> *To:* user@hive.apache.org
> *Subject:* Re: Loading data containing newlines
>
>
>
> I second this. I've generally found anything else to be disappointing when
> working with data which is at all funky.
>
>
>
> On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <apivova...@gmail.com>
> wrote:
>
> Time to use Spark and Spark-Sql in addition to Hive?
>
> It's probably going to happen sooner or later anyway.
>
>
>
> I sent you Spark solution yesterday.  (you just need to write 
> unbzip2AndCsvToListOfArrays(file:
> String): List[Array[String]]  function using BZip2CompressorInputStream
> and Super CSV API)
>
> you can download spark,  open spark-shell and run/debug the program on a
> single computer
>
>
>
> and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
> cluster in 7 min)
>
>
>
> On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <bryan.ger...@pnnl.gov>
> wrote:
>
> 1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
> host.name/data/stg/table/
>
> 2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde
> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
> ‘/data/stg/table/’
>
> 3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties
> ("orc.compress"="ZLIB");
>
> 4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola),
> udf2(colb),functions(),etc. FROM ext_<table>
>
> 5.       Delete files from hdfs://host.name/data/stg/table/
>
>
>
> This has been working quite well, until our newest data contains fields
> with embedded newlines.
>
>
>
> We are now looking into options further up the pipeline to see if we can
> condition the data earlier in the process.
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* Wednesday, January 13, 2016 10:34 AM
>
>
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Thanks Brian.
>
>
>
> Just to clarify do you use something like below?
>
>
>
> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
> rhes564.hedat.net:9000/misc/t.bcp
>
> 2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) COMMENT
> 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
> AS ORC
>
>
>
> Cheers,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov]
> *Sent:* 13 January 2016 18:12
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> We are pushing the compressed text files into HDFS directory for Hive
> EXTERNAL table, then using an INSERT on the table using ORC storage. We are
> letting Hive handle the ORC file creation process.
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk <m...@peridale.co.uk>]
>
> *Sent:* Tuesday, January 12, 2016 4:41 PM
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Hi Bryan,
>
>
>
> As a matter of interest are you loading text files into local directories
> in encrypted format at all and then push it into HDFS/Hive as ORC?
>
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov
> <bryan.ger...@pnnl.gov>]
> *Sent:* 12 January 2016 17:41
> *To:* user@hive.apache.org
> *Subject:* Loading data containing newlines
>
>
>
> We are attempting to load CSV text files (compressed to bz2) containing
> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
> them to condition the data.
>
>
>
> A few days of research has us ready to implement custom  input/output
> formats to handle the ingest.  Any other suggestions that may be less
> effort with low impact to load times?
>
>
>
> Thanks,
>
> Bryan G.
>
>
>
>
>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
>
> Latest news <http://www.handy.com/press> at Handy
>
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>  led
> by Fidelity
>
>
>
> [image: Image removed by sender.]
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity

Reply via email to