Apache Spark Meetup - Wednesday 1st July

2020-06-30 Thread Joe Davies
Good morning,
 I hope this email finds you well.
 I am the host for an on-going series of live webinars/virtual meetups and the 
next 2 weeks are focused on Apache Spark, I was wondering if you could share 
within your group?
 It’s free to sign up and there will be live Q&A throughout the presentation.
 Here is the link - 
https://www.meetup.com/OrbisConnect/events/271400656/
 Thanks,


Joe Davies
Senior Consultant
[cid:image001.jpg@01D36AC0.7B5BDF80]

Direct: +44 (0)203 854 0015
Mobile: +44 (0)7391 650 347
2 Leman Street, We Work, Aldgate Tower, London E1 8FA
www.orbisconsultants.com

Orbis Consultants Limited · Company Reg No. 09749682· Registered Office: 82 St 
John’s Street, London, EC1M 4JN
[🌳]Save a tree - we only print the emails we really need.




Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
There is not much code, I am just using spark-shell and reading the data like so

spark.time(spark.read.json("/data/small-anon/"))

> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta  
> wrote:
> 
> Hi Sanjeev,
> 
> can you share the exact code that you are using to read the JSON files? 
> Currently I am getting only 11 records from the tar file that you have 
> attached with JIRA.
> 
> Regards,
> Gourav Sengupta
> 
> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev  > wrote:
> According to the spec, in addition to the line breaks, you should also put 
> the nested object values in arrays instead of dictionaries.  You may want to 
> give a try and see if this would give you a better performance.
> 
> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  
> Hope the Databricks engineers will find an answer or bug fix soon.
> 
> -- ND
> 
> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
>> The tar file that I have attached has bunch of json.zip files and this is 
>> the file that is being processed. Each line is self contained JSON as shown 
>> below
>> 
>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head -3
>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>>  HNC IGD","Annex F 
>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>>  Arris FastPath Speed 
>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA
>>  
>> ","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>  HNC IGD 
>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
>> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
>> {"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
>>  HNC IGD","Annex F 
>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports
>>  Arris FastPath Speed 
>> Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA
>>  
>> ","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>  HNC IGD 
>> EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total
>>  Control","GPON_100M_100M","Self-Service 
>> Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisionin

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
Hi Sanjeev,
that just gives 11 records from the sample that you have loaded to the JIRA
tickets is it correct?


Regards,
Gourav Sengupta

On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra 
wrote:

> There is not much code, I am just using spark-shell and reading the data
> like so
>
> spark.time(spark.read.json("/data/small-anon/"))
>
> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta 
> wrote:
>
> Hi Sanjeev,
>
> can you share the exact code that you are using to read the JSON files?
> Currently I am getting only 11 records from the tar file that you have
> attached with JIRA.
>
> Regards,
> Gourav Sengupta
>
> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev  wrote:
>
>> According to the spec, in addition to the line breaks, you should also
>> put the nested object values in arrays instead of dictionaries.  You may
>> want to give a try and see if this would give you a better performance.
>>
>> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms
>> 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.
>>
>> -- ND
>> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
>>
>> The tar file that I have attached has bunch of json.zip files and this is
>> the file that is being processed. Each line is self contained JSON as shown
>> below
>>
>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz |
 head -3
>>>
>>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
 HNC IGD","Annex F
 Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
 Arris FastPath Speed
 Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","
 Arris.NVG4xx.Missing.CA 
 ","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
 HNC IGD
 EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
 Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First
 Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
>>>
>>> {"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
 HNC IGD","Annex F
 Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports
 Arris FastPath Speed
 Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","
 Arris.NVG4xx.Missing.CA 
 ","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
 HNC IGD
 EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total
 Control","GPON_100M_100M","Self-Service
 Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0",

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
There are total 11 files as part of tar. You will have to untar it to get to 
actual files (.json.gz)

No, I am getting

Count: 33447

spark.time(spark.read.json(“/data/small-anon/"))
Time taken: 431 ms
res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 more 
fields]

scala> res73.count()
res74: Long = 33447

ls -ltr
total 7592
-rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40 
part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40 
part-9-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40 
part-8-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40 
part-7-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40 
part-6-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40 
part-5-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40 
part-4-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40 
part-3-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40 
part-2-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40 
part-0-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40 
part-1-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff   0 Jun 29 08:40 _SUCCESS

> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta  
> wrote:
> 
> Hi Sanjeev,
> that just gives 11 records from the sample that you have loaded to the JIRA 
> tickets is it correct?
> 
> 
> Regards,
> Gourav Sengupta 
> 
> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra  > wrote:
> There is not much code, I am just using spark-shell and reading the data like 
> so
> 
> spark.time(spark.read.json("/data/small-anon/"))
> 
>> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta > > wrote:
>> 
>> Hi Sanjeev,
>> 
>> can you share the exact code that you are using to read the JSON files? 
>> Currently I am getting only 11 records from the tar file that you have 
>> attached with JIRA.
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev > > wrote:
>> According to the spec, in addition to the line breaks, you should also put 
>> the nested object values in arrays instead of dictionaries.  You may want to 
>> give a try and see if this would give you a better performance.
>> 
>> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  
>> Hope the Databricks engineers will find an answer or bug fix soon.
>> 
>> -- ND
>> 
>> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
>>> The tar file that I have attached has bunch of json.zip files and this is 
>>> the file that is being processed. Each line is self contained JSON as shown 
>>> below
>>> 
>>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head 
>>> -3
>>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>>>  HNC IGD","Annex F 
>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>>>  Arris FastPath Speed 
>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA
>>>  
>>> ","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>>  HNC IGD 
>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>>>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
>>> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisem

Re: File Not Found: /tmp/spark-events in Spark 3.0

2020-06-30 Thread Jeff Evans
This should only be needed if the spark.eventLog.enabled property was set
to true.  Is it possible the job configuration is different between your
two environments?

On Mon, Jun 29, 2020 at 9:21 AM ArtemisDev  wrote:

> While launching a spark job from Zeppelin against a standalone spark
> cluster (Spark 3.0 with multiple workers without hadoop), we have
> encountered a Spark interpreter exception caused by a I/O File Not Found
> exception due to the non-existence of the /tmp/spark-events directory.
> We had to create the /tmp/spark-events directory manually in order to
> resolve the problem.
>
> As a reference, the same notebook code run on Spark 2.4.6 (also a
> standalone cluster) without any problems.
>
> What is /tmp/spark-events for and is there anyway to pre-define this
> directory as some config parameter so we don't end up manually add it in
> /tmp?
>
> Thanks!
>
> -- ND
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
Hi, Sanjeev,

I think that I did precisely that, can you please download my ipython
notebook and have a look, and let me know where I am going wrong. its
attached with the JIRA ticket.


Regards,
Gourav Sengupta

On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra 
wrote:

> There are total 11 files as part of tar. You will have to untar it to get
> to actual files (.json.gz)
>
> No, I am getting
>
> Count: 33447
>
> spark.time(spark.read.json(“/data/small-anon/"))
> Time taken: 431 ms
> res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2
> more fields]
>
> scala> res73.count()
> res74: Long = 33447
>
> ls -ltr
> total 7592
> -rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40
> part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40
> part-9-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40
> part-8-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40
> part-7-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40
> part-6-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40
> part-5-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40
> part-4-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40
> part-3-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40
> part-2-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40
> part-0-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40
> part-1-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff   0 Jun 29 08:40 _SUCCESS
>
> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta 
> wrote:
>
> Hi Sanjeev,
> that just gives 11 records from the sample that you have loaded to the
> JIRA tickets is it correct?
>
>
> Regards,
> Gourav Sengupta
>
> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra 
> wrote:
>
>> There is not much code, I am just using spark-shell and reading the data
>> like so
>>
>> spark.time(spark.read.json("/data/small-anon/"))
>>
>> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta 
>> wrote:
>>
>> Hi Sanjeev,
>>
>> can you share the exact code that you are using to read the JSON files?
>> Currently I am getting only 11 records from the tar file that you have
>> attached with JIRA.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev 
>> wrote:
>>
>>> According to the spec, in addition to the line breaks, you should also
>>> put the nested object values in arrays instead of dictionaries.  You may
>>> want to give a try and see if this would give you a better performance.
>>>
>>> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms
>>> 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.
>>>
>>> -- ND
>>> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
>>>
>>> The tar file that I have attached has bunch of json.zip files and this
>>> is the file that is being processed. Each line is self contained JSON as
>>> shown below
>>>
>>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz |
> head -3

 {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
> HNC IGD","Annex F
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
> Arris FastPath Speed
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","
> Arris.NVG4xx.Missing.CA 
> ","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
> HNC IGD
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
> Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First
> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVe

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
Hi Gourav,

Please check the comments of the ticket, looks like the performance degradation 
is attributed to inferTimestamp option that is true by default (I have no idea 
why) in Spark 3.0. This forces Spark to scan entire text and so the poor 
performance.

Regards
Sanjeev

> On Jun 30, 2020, at 8:12 AM, Gourav Sengupta  
> wrote:
> 
> Hi, Sanjeev,
> 
> I think that I did precisely that, can you please download my ipython 
> notebook and have a look, and let me know where I am going wrong. its 
> attached with the JIRA ticket.
> 
> 
> Regards,
> Gourav Sengupta 
> 
> On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra  > wrote:
> There are total 11 files as part of tar. You will have to untar it to get to 
> actual files (.json.gz)
> 
> No, I am getting
> 
> Count: 33447
> 
> spark.time(spark.read.json(“/data/small-anon/"))
> Time taken: 431 ms
> res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 
> more fields]
> 
> scala> res73.count()
> res74: Long = 33447
> 
> ls -ltr
> total 7592
> -rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40 
> part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40 
> part-9-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40 
> part-8-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40 
> part-7-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40 
> part-6-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40 
> part-5-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40 
> part-4-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40 
> part-3-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40 
> part-2-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40 
> part-0-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40 
> part-1-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff   0 Jun 29 08:40 _SUCCESS
> 
>> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta > > wrote:
>> 
>> Hi Sanjeev,
>> that just gives 11 records from the sample that you have loaded to the JIRA 
>> tickets is it correct?
>> 
>> 
>> Regards,
>> Gourav Sengupta 
>> 
>> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra > > wrote:
>> There is not much code, I am just using spark-shell and reading the data 
>> like so
>> 
>> spark.time(spark.read.json("/data/small-anon/"))
>> 
>>> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta >> > wrote:
>>> 
>>> Hi Sanjeev,
>>> 
>>> can you share the exact code that you are using to read the JSON files? 
>>> Currently I am getting only 11 records from the tar file that you have 
>>> attached with JIRA.
>>> 
>>> Regards,
>>> Gourav Sengupta
>>> 
>>> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev >> > wrote:
>>> According to the spec, in addition to the line breaks, you should also put 
>>> the nested object values in arrays instead of dictionaries.  You may want 
>>> to give a try and see if this would give you a better performance.
>>> 
>>> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  
>>> Hope the Databricks engineers will find an answer or bug fix soon.
>>> 
>>> -- ND
>>> 
>>> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
 The tar file that I have attached has bunch of json.zip files and this is 
 the file that is being processed. Each line is self contained JSON as 
 shown below
 
 zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head 
 -3
 {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
  HNC IGD","Annex F 
 Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
  Arris FastPath Speed 
 Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA
  
 ","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
  HNC IGD 
 EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR0

XmlReader not Parsing the Nested elements in XML properly

2020-06-30 Thread mars76
Hi,

  I am trying to read XML data from a Kafka topic and using XmlReader to
convert the RDD[String] into a DataFrame conforming to predefined Schema.

  One issue i am running into is after saving the final Data Frame to AVRO
format most of the elements data is showing up in avro files. How ever the
nested Element which is of Array Type is not getting parsed properly and
getting loaded as null into the DF and hence when i save it to avro or to
json that field is always null.

  Not sure why this element is not getting parsed.

 
  Here is the code i am using 


  kafkaValueAsStringDF = kafakDF.selectExpr("CAST(key AS STRING)
msgKey","CAST(value AS STRING) xmlString")

  var parameters = collection.mutable.Map.empty[String, String]

  parameters.put("rowTag", "Book")

kafkaValueAsStringDF.writeStream.foreachBatch {
  (batchDF: DataFrame, batchId: Long) =>

 val xmlStringDF:DataFrame = batchDF.selectExpr("xmlString")

xmlStringDF.printSchema()

val rdd: RDD[String] = xmlStringDF.as[String].rdd


val relation = XmlRelation(
  () => rdd,
  None,
  parameters.toMap,
  xmlSchema)(spark.sqlContext)


logger.info(".convert() : XmlRelation Schema ={}
"+relation.schema.treeString)

}
.start()
.awaitTermination()
  

Thanks
Sateesh



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: XmlReader not Parsing the Nested elements in XML properly

2020-06-30 Thread Sean Owen
This is more a question about spark-xml, which is not part of Spark.
You can ask at https://github.com/databricks/spark-xml/ but if you do
please show some example of the XML input and schema and output.

On Tue, Jun 30, 2020 at 11:39 AM mars76  wrote:
>
> Hi,
>
>   I am trying to read XML data from a Kafka topic and using XmlReader to
> convert the RDD[String] into a DataFrame conforming to predefined Schema.
>
>   One issue i am running into is after saving the final Data Frame to AVRO
> format most of the elements data is showing up in avro files. How ever the
> nested Element which is of Array Type is not getting parsed properly and
> getting loaded as null into the DF and hence when i save it to avro or to
> json that field is always null.
>
>   Not sure why this element is not getting parsed.
>
>
>   Here is the code i am using
>
>
>   kafkaValueAsStringDF = kafakDF.selectExpr("CAST(key AS STRING)
> msgKey","CAST(value AS STRING) xmlString")
>
>   var parameters = collection.mutable.Map.empty[String, String]
>
>   parameters.put("rowTag", "Book")
>
> kafkaValueAsStringDF.writeStream.foreachBatch {
>   (batchDF: DataFrame, batchId: Long) =>
>
>  val xmlStringDF:DataFrame = batchDF.selectExpr("xmlString")
>
> xmlStringDF.printSchema()
>
> val rdd: RDD[String] = xmlStringDF.as[String].rdd
>
>
> val relation = XmlRelation(
>   () => rdd,
>   None,
>   parameters.toMap,
>   xmlSchema)(spark.sqlContext)
>
>
> logger.info(".convert() : XmlRelation Schema ={}
> "+relation.schema.treeString)
>
> }
> .start()
> .awaitTermination()
>
>
> Thanks
> Sateesh
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Metrics Problem

2020-06-30 Thread Srinivas V
Then it should permission issue. What kind of cluster is it and which user
is running it ? Does that user have hdfs permissions to access the folder
where the jar file is ?

On Mon, Jun 29, 2020 at 1:17 AM Bryan Jeffrey 
wrote:

> Srinivas,
>
> Interestingly, I did have the metrics jar packaged as part of my main jar.
> It worked well both on driver and locally, but not on executors.
>
> Regards,
>
> Bryan Jeffrey
>
> Get Outlook for Android 
>
> --
> *From:* Srinivas V 
> *Sent:* Saturday, June 27, 2020 1:23:24 AM
>
> *To:* Bryan Jeffrey 
> *Cc:* user 
> *Subject:* Re: Metrics Problem
>
> One option is to create your main jar included with metrics jar like a fat
> jar.
>
> On Sat, Jun 27, 2020 at 8:04 AM Bryan Jeffrey 
> wrote:
>
> Srinivas,
>
> Thanks for the insight. I had not considered a dependency issue as the
> metrics jar works well applied on the driver. Perhaps my main jar
> includes the Hadoop dependencies but the metrics jar does not?
>
> I am confused as the only Hadoop dependency also exists for the built in
> metrics providers which appear to work.
>
> Regards,
>
> Bryan
>
> Get Outlook for Android 
>
> --
> *From:* Srinivas V 
> *Sent:* Friday, June 26, 2020 9:47:52 PM
> *To:* Bryan Jeffrey 
> *Cc:* user 
> *Subject:* Re: Metrics Problem
>
> It should work when you are giving hdfs path as long as your jar exists in
> the path.
> Your error is more security issue (Kerberos) or Hadoop dependencies
> missing I think, your error says :
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
>
> On Fri, Jun 26, 2020 at 8:44 PM Bryan Jeffrey 
> wrote:
>
> It may be helpful to note that I'm running in Yarn cluster mode.  My goal
> is to avoid having to manually distribute the JAR to all of the various
> nodes as this makes versioning deployments difficult.
>
> On Thu, Jun 25, 2020 at 5:32 PM Bryan Jeffrey 
> wrote:
>
> Hello.
>
> I am running Spark 2.4.4. I have implemented a custom metrics producer. It
> works well when I run locally, or specify the metrics producer only for the
> driver.  When I ask for executor metrics I run into ClassNotFoundExceptions
>
> *Is it possible to pass a metrics JAR via --jars?  If so what am I
> missing?*
>
> Deploy driver stats via:
> --jars hdfs:///custommetricsprovider.jar
> --conf
> spark.metrics.conf.driver.sink.metrics.class=org.apache.spark.mycustommetricssink
>
> However, when I pass the JAR with the metrics provider to executors via:
> --jars hdfs:///custommetricsprovider.jar
> --conf
> spark.metrics.conf.executor.sink.metrics.class=org.apache.spark.mycustommetricssink
>
> I get ClassNotFoundException:
>
> 20/06/25 21:19:35 ERROR MetricsSystem: Sink class
> org.apache.spark.custommetricssink cannot be instantiated
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.custommetricssink
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
> at
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
> at
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
> at
> org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
> at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:221)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
> at
> org.

Question about 'maxOffsetsPerTrigger'

2020-06-30 Thread Eric Beabes
While running my Spark (Stateful) Structured Streaming job I am setting
'maxOffsetsPerTrigger' value to 10 Million. I've noticed that messages are
processed faster if I use a large value for this property.

What I am also noticing is that until the batch is completely processed, no
messages are getting written to the output Kafka topic. The 'State timeout'
is set to 10 minutes so I am expecting to see at least some of the
messages after 10 minutes or so BUT messages are not getting written until
processing of the next batch is started.

Is there any property I can use to kinda 'flush' the messages that are
ready to be written? Please let me know. Thanks.


spark on kubernetes client mode

2020-06-30 Thread Pradeepta Choudhury
Hii team,
I am working on spark on kubernetes and was working on a scenario where i
need to use spark on kubernetes in client mode from jupyter notebook from
two different kubernetes cluster . Is it possible in client mode to spin up
driver in one k8 cluster and  executors in another k8 cluster .


Pradeepta


Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
Hi,

I think that the notebook clearly demonstrates that setting the
inferTimestamp option to False does not really help.

Is it really impossible for you to show how your own data can be loaded? It
should be simple, just open the notebook and see why the exact code you
have given does not work, and shows only 11 records.


Regards,
Gourav Sengupta

On Tue, Jun 30, 2020 at 4:15 PM Sanjeev Mishra 
wrote:

> Hi Gourav,
>
> Please check the comments of the ticket, looks like the performance
> degradation is attributed to inferTimestamp option that is true by default
> (I have no idea why) in Spark 3.0. This forces Spark to scan entire text
> and so the poor performance.
>
> Regards
> Sanjeev
>
> On Jun 30, 2020, at 8:12 AM, Gourav Sengupta 
> wrote:
>
> Hi, Sanjeev,
>
> I think that I did precisely that, can you please download my ipython
> notebook and have a look, and let me know where I am going wrong. its
> attached with the JIRA ticket.
>
>
> Regards,
> Gourav Sengupta
>
> On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra 
> wrote:
>
>> There are total 11 files as part of tar. You will have to untar it to get
>> to actual files (.json.gz)
>>
>> No, I am getting
>>
>> Count: 33447
>>
>> spark.time(spark.read.json(“/data/small-anon/"))
>> Time taken: 431 ms
>> res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ...
>> 2 more fields]
>>
>> scala> res73.count()
>> res74: Long = 33447
>>
>> ls -ltr
>> total 7592
>> -rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40
>> part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40
>> part-9-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40
>> part-8-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40
>> part-7-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40
>> part-6-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40
>> part-5-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40
>> part-4-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40
>> part-3-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40
>> part-2-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40
>> part-0-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40
>> part-1-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>> -rw-r--r--  1 sanjeevmishra  staff   0 Jun 29 08:40 _SUCCESS
>>
>> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta 
>> wrote:
>>
>> Hi Sanjeev,
>> that just gives 11 records from the sample that you have loaded to the
>> JIRA tickets is it correct?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra 
>> wrote:
>>
>>> There is not much code, I am just using spark-shell and reading the data
>>> like so
>>>
>>> spark.time(spark.read.json("/data/small-anon/"))
>>>
>>> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta 
>>> wrote:
>>>
>>> Hi Sanjeev,
>>>
>>> can you share the exact code that you are using to read the JSON files?
>>> Currently I am getting only 11 records from the tar file that you have
>>> attached with JIRA.
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev 
>>> wrote:
>>>
 According to the spec, in addition to the line breaks, you should also
 put the nested object values in arrays instead of dictionaries.  You may
 want to give a try and see if this would give you a better performance.

 Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms
 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.

 -- ND
 On 6/29/20 12:27 PM, Sanjeev Mishra wrote:

 The tar file that I have attached has bunch of json.zip files and this
 is the file that is being processed. Each line is self contained JSON as
 shown below

 zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz |
>> head -3
>
> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>> HNC IGD","Annex F
>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>> Arris FastPath Speed
>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","
>> Arris.NVG4xx.Missing.CA 
>> ","Supports.TR98.IPPing","Arris.NVG468MQ.9.3

unsubscribe

2020-06-30 Thread Bartłomiej Niemienionek



Re: unsubscribe

2020-06-30 Thread Jeff Evans
That is not how you unsubscribe.  See here for instructions:
https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e

On Tue, Jun 30, 2020 at 1:31 PM Bartłomiej Niemienionek <
b.niemienio...@gmail.com> wrote:

>


Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
Let me share the Ipython notebook.

On Tue, Jun 30, 2020 at 11:18 AM Gourav Sengupta 
wrote:

> Hi,
>
> I think that the notebook clearly demonstrates that setting the
> inferTimestamp option to False does not really help.
>
> Is it really impossible for you to show how your own data can be loaded?
> It should be simple, just open the notebook and see why the exact code you
> have given does not work, and shows only 11 records.
>
>
> Regards,
> Gourav Sengupta
>
> On Tue, Jun 30, 2020 at 4:15 PM Sanjeev Mishra 
> wrote:
>
>> Hi Gourav,
>>
>> Please check the comments of the ticket, looks like the performance
>> degradation is attributed to inferTimestamp option that is true by default
>> (I have no idea why) in Spark 3.0. This forces Spark to scan entire text
>> and so the poor performance.
>>
>> Regards
>> Sanjeev
>>
>> On Jun 30, 2020, at 8:12 AM, Gourav Sengupta 
>> wrote:
>>
>> Hi, Sanjeev,
>>
>> I think that I did precisely that, can you please download my ipython
>> notebook and have a look, and let me know where I am going wrong. its
>> attached with the JIRA ticket.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra 
>> wrote:
>>
>>> There are total 11 files as part of tar. You will have to untar it to
>>> get to actual files (.json.gz)
>>>
>>> No, I am getting
>>>
>>> Count: 33447
>>>
>>> spark.time(spark.read.json(“/data/small-anon/"))
>>> Time taken: 431 ms
>>> res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ...
>>> 2 more fields]
>>>
>>> scala> res73.count()
>>> res74: Long = 33447
>>>
>>> ls -ltr
>>> total 7592
>>> -rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40
>>> part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40
>>> part-9-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40
>>> part-8-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40
>>> part-7-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40
>>> part-6-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40
>>> part-5-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40
>>> part-4-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40
>>> part-3-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40
>>> part-2-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40
>>> part-0-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40
>>> part-1-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff   0 Jun 29 08:40 _SUCCESS
>>>
>>> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta 
>>> wrote:
>>>
>>> Hi Sanjeev,
>>> that just gives 11 records from the sample that you have loaded to the
>>> JIRA tickets is it correct?
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra 
>>> wrote:
>>>
 There is not much code, I am just using spark-shell and reading the
 data like so

 spark.time(spark.read.json("/data/small-anon/"))

 On Jun 30, 2020, at 3:53 AM, Gourav Sengupta 
 wrote:

 Hi Sanjeev,

 can you share the exact code that you are using to read the JSON files?
 Currently I am getting only 11 records from the tar file that you have
 attached with JIRA.

 Regards,
 Gourav Sengupta

 On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev 
 wrote:

> According to the spec, in addition to the line breaks, you should also
> put the nested object values in arrays instead of dictionaries.  You may
> want to give a try and see if this would give you a better performance.
>
> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms
> 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.
>
> -- ND
> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
>
> The tar file that I have attached has bunch of json.zip files and this
> is the file that is being processed. Each line is self contained JSON as
> shown below
>
> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz |
>>> head -3
>>
>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>>> HNC IGD","Annex F
>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>>> Arris FastPath Speed
>>> Test

Re: Question about 'maxOffsetsPerTrigger'

2020-06-30 Thread Jungtaek Lim
As Spark uses micro-batch for streaming, it's unavoidable to adjust the
batch size properly to achieve your expectation of throughput vs latency.
Especially, Spark uses global watermark which doesn't propagate (change)
during micro-batch, you'd want to make the batch relatively small to make
watermark move forward faster.

On Wed, Jul 1, 2020 at 2:54 AM Eric Beabes  wrote:

> While running my Spark (Stateful) Structured Streaming job I am setting
> 'maxOffsetsPerTrigger' value to 10 Million. I've noticed that messages are
> processed faster if I use a large value for this property.
>
> What I am also noticing is that until the batch is completely processed,
> no messages are getting written to the output Kafka topic. The 'State
> timeout' is set to 10 minutes so I am expecting to see at least some of the
> messages after 10 minutes or so BUT messages are not getting written until
> processing of the next batch is started.
>
> Is there any property I can use to kinda 'flush' the messages that are
> ready to be written? Please let me know. Thanks.
>
>