Hi Gourav, Please check the comments of the ticket, looks like the performance degradation is attributed to inferTimestamp option that is true by default (I have no idea why) in Spark 3.0. This forces Spark to scan entire text and so the poor performance.
Regards Sanjeev > On Jun 30, 2020, at 8:12 AM, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > > Hi, Sanjeev, > > I think that I did precisely that, can you please download my ipython > notebook and have a look, and let me know where I am going wrong. its > attached with the JIRA ticket. > > > Regards, > Gourav Sengupta > > On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra <sanjeev.mis...@gmail.com > <mailto:sanjeev.mis...@gmail.com>> wrote: > There are total 11 files as part of tar. You will have to untar it to get to > actual files (.json.gz) > > No, I am getting > > Count: 33447 > > spark.time(spark.read.json(“/data/small-anon/")) > Time taken: 431 ms > res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 > more fields] > > scala> res73.count() > res74: Long = 33447 > > ls -ltr > total 7592 > -rw-r--r-- 1 sanjeevmishra staff 132413 Jun 29 08:40 > part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz > -rw-r--r-- 1 sanjeevmishra staff 272767 Jun 29 08:40 > part-00009-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz > -rw-r--r-- 1 sanjeevmishra staff 272314 Jun 29 08:40 > part-00008-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz > -rw-r--r-- 1 sanjeevmishra staff 277158 Jun 29 08:40 > part-00007-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz > -rw-r--r-- 1 sanjeevmishra staff 321451 Jun 29 08:40 > part-00006-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz > -rw-r--r-- 1 sanjeevmishra staff 331419 Jun 29 08:40 > part-00005-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz > -rw-r--r-- 1 sanjeevmishra staff 337195 Jun 29 08:40 > part-00004-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz > -rw-r--r-- 1 sanjeevmishra staff 366346 Jun 29 08:40 > part-00003-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz > -rw-r--r-- 1 sanjeevmishra staff 423154 Jun 29 08:40 > part-00002-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz > -rw-r--r-- 1 sanjeevmishra staff 458187 Jun 29 08:40 > part-00000-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz > -rw-r--r-- 1 sanjeevmishra staff 673836 Jun 29 08:40 > part-00001-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz > -rw-r--r-- 1 sanjeevmishra staff 0 Jun 29 08:40 _SUCCESS > >> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta <gourav.sengu...@gmail.com >> <mailto:gourav.sengu...@gmail.com>> wrote: >> >> Hi Sanjeev, >> that just gives 11 records from the sample that you have loaded to the JIRA >> tickets is it correct? >> >> >> Regards, >> Gourav Sengupta >> >> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra <sanjeev.mis...@gmail.com >> <mailto:sanjeev.mis...@gmail.com>> wrote: >> There is not much code, I am just using spark-shell and reading the data >> like so >> >> spark.time(spark.read.json("/data/small-anon/")) >> >>> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta <gourav.sengu...@gmail.com >>> <mailto:gourav.sengu...@gmail.com>> wrote: >>> >>> Hi Sanjeev, >>> >>> can you share the exact code that you are using to read the JSON files? >>> Currently I am getting only 11 records from the tar file that you have >>> attached with JIRA. >>> >>> Regards, >>> Gourav Sengupta >>> >>> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev <arte...@dtechspace.com >>> <mailto:arte...@dtechspace.com>> wrote: >>> According to the spec, in addition to the line breaks, you should also put >>> the nested object values in arrays instead of dictionaries. You may want >>> to give a try and see if this would give you a better performance. >>> >>> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0. >>> Hope the Databricks engineers will find an answer or bug fix soon. >>> >>> -- ND >>> >>> On 6/29/20 12:27 PM, Sanjeev Mishra wrote: >>>> The tar file that I have attached has bunch of json.zip files and this is >>>> the file that is being processed. Each line is self contained JSON as >>>> shown below >>>> >>>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head >>>> -3 >>>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS >>>> HNC IGD","Annex F >>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports >>>> Arris FastPath Speed >>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA >>>> >>>> <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS >>>> HNC IGD >>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service >>>> Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First >>>> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260} >>>> {"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS >>>> HNC IGD","Annex F >>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports >>>> Arris FastPath Speed >>>> Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA >>>> >>>> <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS >>>> HNC IGD >>>> EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total >>>> Control","GPON_100M_100M","Self-Service >>>> Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253} >>>> {"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS >>>> HNC IGD","Annex F >>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports >>>> Arris FastPath Speed >>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Arris.NVG4xx.Missing.CA >>>> >>>> <http://arris.nvg4xx.missing.ca/>","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS >>>> HNC IGD >>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total >>>> Control","HSI","Self-Service >>>> Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - >>>> First Contact","GPON_200M_200M","TCW >>>> Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415} >>>> >>>> On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <arte...@dtechspace.com >>>> <mailto:arte...@dtechspace.com>> wrote: >>>> Could you please share your input file instead of output files on that >>>> ticket? Not sure if you were following the specific file format >>>> requirement for JSON in Spark. The following is a snippet from the Spark >>>> online doc: >>>> >>>> Note that the file that is offered as a json file is not a typical JSON >>>> file. Each line must contain a separate, self-contained valid JSON object. >>>> For more information, please see JSON Lines text format, also called >>>> newline-delimited JSON <http://jsonlines.org/>. >>>> >>>> For a regular multi-line JSON file, set the multiLine option to true. >>>> >>>> -- ND >>>> >>>> On 6/29/20 11:55 AM, Sanjeev Mishra wrote: >>>>> Done. https://issues.apache.org/jira/browse/SPARK-32130 >>>>> <https://issues.apache.org/jira/browse/SPARK-32130> >>>>> >>>>> >>>>> >>>>> On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <maxim.g...@databricks.com >>>>> <mailto:maxim.g...@databricks.com>> wrote: >>>>> Hello Sanjeev, >>>>> >>>>> It is hard to troubleshoot the issue without input files. Could you open >>>>> an JIRA ticket at https://issues.apache.org/jira/projects/SPARK >>>>> <https://issues.apache.org/jira/projects/SPARK> and attach the JSON files >>>>> there (or samples or code which generates JSON files)? >>>>> >>>>> Maxim Gekk >>>>> Software Engineer >>>>> Databricks, Inc. >>>>> >>>>> >>>>> On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <sanjeev.mis...@gmail.com >>>>> <mailto:sanjeev.mis...@gmail.com>> wrote: >>>>> It has read everything. As you notice the timing of count is still >>>>> smaller in Spark 2.4 >>>>> >>>>> Spark 2.4 >>>>> >>>>> scala> spark.time(spark.read.json("/data/20200528")) >>>>> Time taken: 19691 ms >>>>> res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... >>>>> 5 more fields] >>>>> >>>>> scala> spark.time(res61.count()) >>>>> Time taken: 7113 ms >>>>> res64: Long = 2605349 >>>>> >>>>> Spark 3.0 >>>>> scala> spark.time(spark.read.json("/data/20200528")) >>>>> 20/06/29 08:06:53 WARN package: Truncated the string representation of a >>>>> plan since it was too large. This behavior can be adjusted by setting >>>>> 'spark.sql.debug.maxToStringFields'. >>>>> Time taken: 849652 ms >>>>> res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 >>>>> more fields] >>>>> >>>>> scala> spark.time(res0.count()) >>>>> Time taken: 8201 ms >>>>> res2: Long = 2605349 >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <arte...@dtechspace.com >>>>> <mailto:arte...@dtechspace.com>> wrote: >>>>> Could you share your code? Are you sure you Spark 2.4 cluster had indeed >>>>> read anything? Looks like the Input size field is empty under 2.4. >>>>> >>>>> -- ND >>>>> >>>>> On 6/27/20 7:58 PM, Sanjeev Mishra wrote: >>>>>> >>>>>> I have large amount of json files that Spark can read in 36 seconds but >>>>>> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, >>>>>> looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does >>>>>> anyone have any idea what is going on? Is there any configuration >>>>>> problem with Spark 3.0. >>>>>> >>>>>> Here are the details: >>>>>> >>>>>> Spark 2.4 >>>>>> >>>>>> Summary Metrics for 2203 Completed Tasks >>>>>> <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle> >>>>>> Metric Min 25th percentile Median 75th percentile Max >>>>>> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms >>>>>> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms >>>>>> Showing 1 to 2 of 2 entries >>>>>> Aggregated Metrics by Executor <> >>>>>> Show entries >>>>>> Search: >>>>>> Executor ID Logs Address Task Time Total Tasks Failed >>>>>> Tasks Killed Tasks Succeeded Tasks Blacklisted >>>>>> driver >>>>>> 10.0.0.8:49159 <http://10.0.0.8:49159/> 36 s 2203 0 0 >>>>>> 2203 false >>>>>> >>>>>> >>>>>> Spark 3.0 >>>>>> >>>>>> Summary Metrics for 8 Completed Tasks >>>>>> <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle> >>>>>> Metric Min 25th percentile Median 75th percentile Max >>>>>> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min >>>>>> GC Time 3 s 3 s 3 s 4 s 4 s >>>>>> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 >>>>>> 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624 >>>>>> Showing 1 to 3 of 3 entries >>>>>> Aggregated Metrics by Executor <> >>>>>> Show entries >>>>>> Search: >>>>>> Executor ID Logs Address Task Time Total Tasks Failed >>>>>> Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / >>>>>> Records >>>>>> driver >>>>>> 10.0.0.8:50224 <http://10.0.0.8:50224/> 33 min 8 0 0 >>>>>> 8 false 136.1 MiB / 451999 >>>>>> >>>>>> >>>>>> The DAG is also different >>>>>> Spark 2.0 DAG >>>>>> >>>>>> <Screenshot 2020-06-27 16.30.26.png> >>>>>> >>>>>> Spark 3.0 DAG >>>>>> >>>>>> <Screenshot 2020-06-27 16.32.32.png> >>>>>> >>>>>> >> >