Just did some tests. I have 6000 files, each has 14K records with 900Mb file size. In spark sql, it would take one task roughly 1 min to parse.
On the local machine, using the same Jackson lib inside Spark lib. Just parse it. FileInputStream fstream = new FileInputStream("testfile"); BufferedReader br = new BufferedReader(new InputStreamReader(fstream)); String strLine; Long begin = System.currentTimeMillis(); while ((strLine = br.readLine()) != null) { JsonNode s = mapper.readTree(strLine); } System.out.println(System.currentTimeMillis() - begin); In JDK8, it took *6270ms. * Same code in Scala, it would take *7486ms* val begin = java.lang.System.currentTimeMillis() for(line <- Source.fromFile("testfile").getLines()) { val mapper = new ObjectMapper() mapper.registerModule(DefaultScalaModule) val s = mapper.readTree(line) } println(java.lang.System.currentTimeMillis() - begin) One Json record contains two fileds : ID and List[Event]. I am guessing put all the events into List would take the left time. Any solution to speed this up? Thanks a lot! On Thu, Aug 27, 2015 at 7:45 PM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > For your jsons, can you tell us what is your benchmark when running on a > single machine using just plain Java (without Spark and Spark sql)? > > Regards > Sab > On 28-Aug-2015 7:29 am, "Gavin Yue" <yue.yuany...@gmail.com> wrote: > >> Hey >> >> I am using the Json4s-Jackson parser coming with spark and parsing >> roughly 80m records with totally size 900mb. >> >> But the speed is slow. It took my 50 nodes(16cores cpu,100gb mem) >> roughly 30mins to parse Json to use spark sql. >> >> Jackson has the benchmark saying parsing should be ms level. >> >> Any way to increase speed? >> >> I am using spark 1.4 on Hadoop 2.7 with Java 8. >> >> Thanks a lot ! >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>