Just did some tests.

I have 6000 files, each has 14K records with 900Mb file size.  In spark
sql, it would take one task roughly 1 min to parse.

On the local machine, using the same Jackson lib inside Spark lib. Just
parse it.

            FileInputStream fstream = new FileInputStream("testfile");
            BufferedReader br = new BufferedReader(new
InputStreamReader(fstream));
            String strLine;
            Long begin = System.currentTimeMillis();
             while ((strLine = br.readLine()) != null)   {
                JsonNode s = mapper.readTree(strLine);
             }
            System.out.println(System.currentTimeMillis() - begin);

In JDK8, it took *6270ms. *

Same code in Scala, it would take *7486ms*
   val begin =  java.lang.System.currentTimeMillis()
    for(line <- Source.fromFile("testfile").getLines())
    {
      val mapper = new ObjectMapper()
      mapper.registerModule(DefaultScalaModule)
      val s = mapper.readTree(line)
    }
    println(java.lang.System.currentTimeMillis() - begin)


One Json record contains two fileds :  ID and List[Event].

I am guessing put all the events into List would take the left time.

Any solution to speed this up?

Thanks a lot!


On Thu, Aug 27, 2015 at 7:45 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> For your jsons, can you tell us what is your benchmark when running on a
> single machine using just plain Java (without Spark and Spark sql)?
>
> Regards
> Sab
> On 28-Aug-2015 7:29 am, "Gavin Yue" <yue.yuany...@gmail.com> wrote:
>
>> Hey
>>
>> I am using the Json4s-Jackson parser coming with spark and parsing
>> roughly 80m records with totally size 900mb.
>>
>> But the speed is slow.  It took my 50 nodes(16cores cpu,100gb mem)
>> roughly 30mins to parse Json to use spark sql.
>>
>> Jackson has the benchmark saying parsing should be ms level.
>>
>> Any way to increase speed?
>>
>> I am using spark 1.4 on Hadoop 2.7 with Java 8.
>>
>> Thanks a lot !
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Reply via email to