Re: [EXT] Debugging a local spark executor in pycharm

2018-03-13 Thread Michael Mansour
Vitaliy, From what I understand, this is not possible to do. However, let me share my workaround with you. Assuming you have your debugger up and running on PyCharm, set a breakpoint at this line, Take|collect|sample your data (could also consider doing a glom if its critical the data remain

Insufficient memory for Java Runtime

2018-03-13 Thread Shiyuan
Hi Spark-Users, I encountered the problem of "insufficient memory". The error is logged in the file with a name " hs_err_pid86252.log"(attached in the end of this email). I launched the spark job by " spark-submit --driver-memory 40g --master yarn --deploy-mode client". The spark session was cr

Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Jörn Franke
Ah sorry I thought you use EDI xml. Then you would need to build your own Spark datasource. Depending on the number of different type of messages this will be much more or less effort. I am not aware of any commercial or open source solution for it. > On 13. Mar 2018, at 13:52, Aakash Basu wro

Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Kurt Fehlhauer
If no pre-built solution exists, writing your own would not be that difficult. I suggest looking at a parser combinator such as FastParse to create your own. http://www.lihaoyi.com/fastparse/ Regards, Kurt On Tue, Mar 13, 2018 at 7:47 AM Aakash Basu wrote: > Thanks again for the detailed expla

Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Aakash Basu
Thanks again for the detailed explanation, would like to go through. In my case, I'm having to parse large scale *.as2*, *.P3193*, *.edi *and *.txt *data mapping it with the respective standards and then building a JSON (so XML doesn't comes into the picture), containing the following (small examp

Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Darin McBeath
I'm not familiar with EDI, but perhaps one option might be spark-xml-utils (https://github.com/elsevierlabs-os/spark-xml-utils).  You could transform the XML to the XML format required by  the xml-to-json function and then return the json.  Spark-xml-utils wraps the open source Saxon project an

Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Aakash Basu
Hi Jörn, Thanks for a quick revert. I already built a EDI to JSON parser from scratch using the 811 and 820 standard mapping document. It can run on any standard and for any type of EDI. But my built is in native python and doesn't leverage Spark's parallel processing, which I want to do for large

Spark MongoDb $lookup aggregations

2018-03-13 Thread Laptere
Hello, I am new to Spark and trying now to do some aggregations with Spark instead of just a Shell. Does Spark allow to make a $lookup aggregations? And if so, can you, please, share resources or examples of how that could be done? Thanks a lot for any help in advance! Best regards, Alena --

Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Jörn Franke
Maybe there are commercial ones. You could also some of the open source parser for xml. However xml is very inefficient and you need to du a lot of tricks to make it run in parallel. This also depends on type of edit message etc. sophisticated unit testing and performance testing is key. Neve

EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Aakash Basu
Hi, Did anyone built parallel and large scale X12 EDI parser to XML or JSON using Spark? Thanks, Aakash.

Broadcast variables: destroy/unpersist unexpected behaviour

2018-03-13 Thread Sunil
I experienced the below two cases when unpersisting or destroying broadcast variables in pyspark. But the same works good in spark scala shell. Any clue why this happens ? Is it a bug in pyspark? ***Case 1:*** >>> b1 = sc.broadcast([1,2,3]) >>> b1.value [1, 2, 3] >>> b1.destroy()