---------- Forwarded message ---------- From: JOHN MILLER <jmill...@gmail.com> Date: Fri, Dec 4, 2015 at 10:24 AM Subject: DATA TRANSFORMATION PROBLEM To: i...@data-artisans.com
*Greetings* *I am writing to obtain an approach to resolve a data transformation problem The problem is that I want to format a new dataset which would allow processing continue instead of bombing. The dataset i want to convert is a series of WARC files (currently read in as text...examples are attached) CC-MAIN-TEXT-20130516092621-00003-ip-10-60-113-... <https://drive.google.com/file/d/0B5QdPKF22EFxMDlTX3BzSW9uTDg/view?usp=drive_web> I am trying to parse out a field names and values and format a new dataset which would then be converted to CSV or TSV* *The fields in questionHeader: {WARC-Type=warcinfo, WARC-Filename=CC-MAIN-TEXT-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz, reader-identifier=/home/jmill383/wdcdemobucket/CC-MAIN-TEXT-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz, WARC-Date=2013-11-22T14:51:12Z, absolute-offset=0, Content-Length=372, WARC-Record-ID=<urn:uuid:efdf19de-e663-4747-8a98-754bd224520f>, Content-Type=application/warc-fields}URL: null* *Please advise if you can assist with an approach to resolve this problem I am using Apache Flink...Scalding ....Scala I havent been able to get too far as of yet Please advise if you can assist* *John M*