Thanks again for the detailed explanation, would like to go through.

In my case, I'm having to parse large scale *.as2*, *.P3193*, *.edi *and *.txt
*data mapping it with the respective standards and then building a JSON (so
XML doesn't comes into the picture), containing the following (small
example of EDI) -

ISA*00*          *00*          *ZZ*D00XXX         *ZZ*00AA
*070305*1832*^*00501*676048320*0*P*\~
GS*BE*D00XXX*00AA*20150305*1832*260007982*X*005010X220A1~
ST*834*0001*005010X220A1~
BGN*00*88880070301  00*20150305*181245****4~
DTP*007*D8*20150301~
N1*P5*PAYER 1*FI*999999999~
N1*IN*KCMHSAS*FI*999999999~
INS*Y*18*030*XN*A*C   **FT~
REF*0F*00389999~
REF*1L*000003409999~
REF*3H*K129999A~
DTP*356*D8*20150301~
NM1*IL*1*DOE*JOHN*A***34*999999999~
N3*777 ELM ST~
N4*ALLEGAN*MI*49010**CY*03~
DMG*D8*19670330*M**O~
LUI***ESSPANISH~
HD*030**AK*064703*IND~
DTP*348*D8*20150301~
AMT*P3*45.34~
REF*17*E  1F~
SE*20*0001~
GE*1*260007982~
IEA*1*676048320~



Thanks,
Aakash.

On Tue, Mar 13, 2018 at 6:37 PM, Darin McBeath <ddmcbe...@yahoo.com> wrote:

> I'm not familiar with EDI, but perhaps one option might be spark-xml-utils
> (https://github.com/elsevierlabs-os/spark-xml-utils).  You could
> transform the XML to the XML format required by  the xml-to-json function
> and then return the json.  Spark-xml-utils wraps the open source Saxon
> project and supports XPath, XQuery, and XSLT.    Spark-xml-utils doesn't
> parallelize the parsing of an individual document, but if you have your
> documents split across a cluster, the processing can be parallelized.  We
> use this package extensively within our company to process millions of XML
> records.  If you happen to be attending Spark summit in a few months,
> someone will be presenting on this topic (https://databricks.com/
> session/mining-the-worlds-science-large-scale-data-
> matching-and-integration-from-xml-corpora).
>
>
> Below is a snippet for xquery.
>
> let $retval :=
>      <map>
>        <string key="doi">{$doi}</string>
>        <string key="cid">{$cid}</string>
>        <string key="pii">{$pii}</string>
>        <string key="contentType">{$content-type}</string>
>        <string key="srctitle">{$srctitle}</string>
>        <string key="documentType">{$document-type}</string>
>        <string key="documentSubtype">{$document-subtype}</string>
>        <string key="publicationDate">{$publication-date}</string>
>        <string key="articleTitle">{$article-title}</string>
>        <string key="issn">{$issn}</string>
>        <string key="isbn">{$isbn}</string>
>        <string key="lang">{$lang}</string>
>        {$tables}
>      </map>
>
> return xml-to-json($retval)
>
>
> Darin.
>
> On Tuesday, March 13, 2018, 8:52:42 AM EDT, Aakash Basu <
> aakash.spark....@gmail.com> wrote:
>
>
> Hi Jörn,
>
> Thanks for a quick revert. I already built a EDI to JSON parser from
> scratch using the 811 and 820 standard mapping document. It can run on any
> standard and for any type of EDI. But my built is in native python and
> doesn't leverage Spark's parallel processing, which I want to do for large
> and huge amount of EDI data.
>
> Any pointers on that?
>
> Thanks,
> Aakash.
>
> On Tue, Mar 13, 2018 at 3:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> Maybe there are commercial ones. You could also some of the open source
> parser for xml.
>
> However xml is very inefficient and you need to du a lot of tricks to make
> it run in parallel. This also depends on type of edit message etc.
> sophisticated unit testing and performance testing is key.
>
> Nevertheless it is also not as difficult as I made it sound now.
>
> > On 13. Mar 2018, at 10:36, Aakash Basu <aakash.spark....@gmail.com>
> wrote:
> >
> > Hi,
> >
> > Did anyone built parallel and large scale X12 EDI parser to XML or JSON
> using Spark?
> >
> > Thanks,
> > Aakash.
>
>
>

Reply via email to