Thanks for your time Mich! I will try this one out. On Thursday, March 17, 2016, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> Then probably the easiest option would be in INSERT/SELECT from external > table to target table and make that column NULL > > Check the VAT column here that I made it NULL > > DROP TABLE IF EXISTS stg_t2; > CREATE EXTERNAL TABLE stg_t2 ( > INVOICENUMBER string > ,PAYMENTDATE string > ,NET string > ,VAT string > ,TOTAL string > ) > COMMENT 'from csv file from excel sheet xxxx' > ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > STORED AS TEXTFILE > LOCATION '/data/stg/table2' > TBLPROPERTIES ("skip.header.line.count"="1") > ; > --3) > DROP TABLE IF EXISTS t2; > CREATE TABLE t2 ( > INVOICENUMBER INT > ,PAYMENTDATE timestamp > ,NET DECIMAL(20,2) > ,VAT DECIMAL(20,2) > ,TOTAL DECIMAL(20,2) > ) > COMMENT 'from csv file from excel sheet xxxx' > CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS > STORED AS ORC > TBLPROPERTIES ( "orc.compress"="ZLIB", > "transactional"="true") > ; > --4) Put data in target table. do the conversion and ignore empty rows > INSERT INTO TABLE t2 > SELECT > INVOICENUMBER > , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp) > , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2)) > , NULL > , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > FROM > stg_t2 > WHERE > -- INVOICENUMBER > 0 AND > CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0 > -- Exclude empty rows > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 17 March 2016 at 15:32, Ajay Chander <hadoopde...@gmail.com > <javascript:_e(%7B%7D,'cvml','hadoopde...@gmail.com');>> wrote: > >> Mich, I am okay with replacing the columns data with some characters >> like asterisk. Thanks >> >> >> On Thursday, March 17, 2016, Mich Talebzadeh <mich.talebza...@gmail.com >> <javascript:_e(%7B%7D,'cvml','mich.talebza...@gmail.com');>> wrote: >> >>> Hi Ajay, >>> >>> Do you want to be able to unmask it (at any time) or just have it >>> totally scrambled (for example replace the column with random characters) >>> in Hive? >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 17 March 2016 at 15:14, Ajay Chander <hadoopde...@gmail.com> wrote: >>> >>>> Mich thbaks for looking into this. I have a 'csvfile.txt ' on hdfs. I >>>> have created an external table 'xyz' to load that data into it. One of the >>>> columns data 'ssn' needs to be masked. Is there any built in function is >>>> give that I could use? >>>> >>>> >>>> On Thursday, March 17, 2016, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>>> Are you loading your CSV file from an External table into Hive table.? >>>>> >>>>> Basically you want to scramble that column before putting into Hive >>>>> table? >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> On 17 March 2016 at 14:37, Ajay Chander <hadoopde...@gmail.com> wrote: >>>>> >>>>>> Tustin, Is there anyway I can deidentify it in hive ? >>>>>> >>>>>> >>>>>> On Thursday, March 17, 2016, Marcin Tustin <mtus...@handybook.com> >>>>>> wrote: >>>>>> >>>>>>> This is a classic transform-load problem. You'll want to anonymise >>>>>>> it once before making it available for analysis. >>>>>>> >>>>>>> On Thursday, March 17, 2016, Ajay Chander <hadoopde...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Everyone, >>>>>>>> >>>>>>>> I have a csv.file which has some sensitive data in a particular >>>>>>>> column in it. Now I have to create a table in hive and load the data >>>>>>>> into >>>>>>>> it. But when loading the data I have to make sure that the data is >>>>>>>> masked. >>>>>>>> Is there any built in function is used ch supports this or do I have to >>>>>>>> write UDF ? Any suggestions are appreciated. Thanks >>>>>>> >>>>>>> >>>>>>> Want to work at Handy? Check out our culture deck and open roles >>>>>>> <http://www.handy.com/careers> >>>>>>> Latest news <http://www.handy.com/press> at Handy >>>>>>> Handy just raised $50m >>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> >>>>>>> led >>>>>>> by Fidelity >>>>>>> >>>>>>> >>>>> >>> >