Thanks again for the detailed explanation, would like to go through. In my case, I'm having to parse large scale *.as2*, *.P3193*, *.edi *and *.txt *data mapping it with the respective standards and then building a JSON (so XML doesn't comes into the picture), containing the following (small example of EDI) -
ISA*00* *00* *ZZ*D00XXX *ZZ*00AA *070305*1832*^*00501*676048320*0*P*\~ GS*BE*D00XXX*00AA*20150305*1832*260007982*X*005010X220A1~ ST*834*0001*005010X220A1~ BGN*00*88880070301 00*20150305*181245****4~ DTP*007*D8*20150301~ N1*P5*PAYER 1*FI*999999999~ N1*IN*KCMHSAS*FI*999999999~ INS*Y*18*030*XN*A*C **FT~ REF*0F*00389999~ REF*1L*000003409999~ REF*3H*K129999A~ DTP*356*D8*20150301~ NM1*IL*1*DOE*JOHN*A***34*999999999~ N3*777 ELM ST~ N4*ALLEGAN*MI*49010**CY*03~ DMG*D8*19670330*M**O~ LUI***ESSPANISH~ HD*030**AK*064703*IND~ DTP*348*D8*20150301~ AMT*P3*45.34~ REF*17*E 1F~ SE*20*0001~ GE*1*260007982~ IEA*1*676048320~ Thanks, Aakash. On Tue, Mar 13, 2018 at 6:37 PM, Darin McBeath <ddmcbe...@yahoo.com> wrote: > I'm not familiar with EDI, but perhaps one option might be spark-xml-utils > (https://github.com/elsevierlabs-os/spark-xml-utils). You could > transform the XML to the XML format required by the xml-to-json function > and then return the json. Spark-xml-utils wraps the open source Saxon > project and supports XPath, XQuery, and XSLT. Spark-xml-utils doesn't > parallelize the parsing of an individual document, but if you have your > documents split across a cluster, the processing can be parallelized. We > use this package extensively within our company to process millions of XML > records. If you happen to be attending Spark summit in a few months, > someone will be presenting on this topic (https://databricks.com/ > session/mining-the-worlds-science-large-scale-data- > matching-and-integration-from-xml-corpora). > > > Below is a snippet for xquery. > > let $retval := > <map> > <string key="doi">{$doi}</string> > <string key="cid">{$cid}</string> > <string key="pii">{$pii}</string> > <string key="contentType">{$content-type}</string> > <string key="srctitle">{$srctitle}</string> > <string key="documentType">{$document-type}</string> > <string key="documentSubtype">{$document-subtype}</string> > <string key="publicationDate">{$publication-date}</string> > <string key="articleTitle">{$article-title}</string> > <string key="issn">{$issn}</string> > <string key="isbn">{$isbn}</string> > <string key="lang">{$lang}</string> > {$tables} > </map> > > return xml-to-json($retval) > > > Darin. > > On Tuesday, March 13, 2018, 8:52:42 AM EDT, Aakash Basu < > aakash.spark....@gmail.com> wrote: > > > Hi Jörn, > > Thanks for a quick revert. I already built a EDI to JSON parser from > scratch using the 811 and 820 standard mapping document. It can run on any > standard and for any type of EDI. But my built is in native python and > doesn't leverage Spark's parallel processing, which I want to do for large > and huge amount of EDI data. > > Any pointers on that? > > Thanks, > Aakash. > > On Tue, Mar 13, 2018 at 3:44 PM, Jörn Franke <jornfra...@gmail.com> wrote: > > Maybe there are commercial ones. You could also some of the open source > parser for xml. > > However xml is very inefficient and you need to du a lot of tricks to make > it run in parallel. This also depends on type of edit message etc. > sophisticated unit testing and performance testing is key. > > Nevertheless it is also not as difficult as I made it sound now. > > > On 13. Mar 2018, at 10:36, Aakash Basu <aakash.spark....@gmail.com> > wrote: > > > > Hi, > > > > Did anyone built parallel and large scale X12 EDI parser to XML or JSON > using Spark? > > > > Thanks, > > Aakash. > > >