Just did some tests.
I have 6000 files, each has 14K records with 900Mb file size. In spark
sql, it would take one task roughly 1 min to parse.
On the local machine, using the same Jackson lib inside Spark lib. Just
parse it.
FileInputStream fstream = new FileInputStream("testfile");
BufferedReader br = new BufferedReader(new
InputStreamReader(fstream));
String strLine;
Long begin = System.currentTimeMillis();
while ((strLine = br.readLine()) != null) {
JsonNode s = mapper.readTree(strLine);
}
System.out.println(System.currentTimeMillis() - begin);
In JDK8, it took *6270ms. *
Same code in Scala, it would take *7486ms*
val begin = java.lang.System.currentTimeMillis()
for(line <- Source.fromFile("testfile").getLines())
{
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
val s = mapper.readTree(line)
}
println(java.lang.System.currentTimeMillis() - begin)
One Json record contains two fileds : ID and List[Event].
I am guessing put all the events into List would take the left time.
Any solution to speed this up?
Thanks a lot!
On Thu, Aug 27, 2015 at 7:45 PM, Sabarish Sasidharan <
[email protected]> wrote:
> For your jsons, can you tell us what is your benchmark when running on a
> single machine using just plain Java (without Spark and Spark sql)?
>
> Regards
> Sab
> On 28-Aug-2015 7:29 am, "Gavin Yue" <[email protected]> wrote:
>
>> Hey
>>
>> I am using the Json4s-Jackson parser coming with spark and parsing
>> roughly 80m records with totally size 900mb.
>>
>> But the speed is slow. It took my 50 nodes(16cores cpu,100gb mem)
>> roughly 30mins to parse Json to use spark sql.
>>
>> Jackson has the benchmark saying parsing should be ms level.
>>
>> Any way to increase speed?
>>
>> I am using spark 1.4 on Hadoop 2.7 with Java 8.
>>
>> Thanks a lot !
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>