Hi There, 

I am trying to process millions of data with spark/scala integrated with
stanford NLP (3.4.1). 
Since I am using social media data I have to use NLP for the themes
generation (pos tagging) and Sentiment calulation. 

I have to deal with Twitter data and NON Twitter data separately.So I have
two classes that deal with Twitter/Non Twitter 

I am using lasy val initialization from each class for loading the
stanfordNLP 

  

    features: Seq[String] =
Seq("tokenize","ssplit","pos","parse","sentiment") 
      val props = new Properties() 
      props.put("annotators", features.mkString(", ")) 
      props.put("pos.model", "tagger/gate-EN-twitter.model") 
      props.put("parse.model", "tagger/englishSR.ser.gz"); 
      val pipeline = new StanfordCoreNLP(props) 

Note: For above Twitter I am using different pos model and shift reduce
parse model for parsing. The reason I use shift reduce parser is for some of
the junk 
data at rum time the default PCFG model takes lot of time for processing and
getting some Exception.Shift reduce parser will take around 15 seconds at
load time and its faster at run time while processing the 
data. 

NonTwitter class 
    

     features: Seq[String] =
Seq("tokenize","ssplit","pos","parse","sentiment") 
        val props = new Properties() 
        props.put("annotators", features.mkString(", ")) 
        props.put("parse.model", "tagger/englishSR.ser.gz"); 

Here I am using the default pos model and shift reduce parser 

Problem: 

Currently we am running with 8 Nodes with 6 cores and I can run with 48
partition. For to process millions of data 
with the above configuratin with lesser partition it works fine for me. 

8 Nodes and 6 cores we have almost 48 partition and if I ran with 42 Number
of partition it takes around 1 hr to finish the processing. 

with the current configuration I need to scale it to 200 partition 

8 Nodes and 6 cores we have almost 48 partition and if we ran with 200
Number of partition it takes around 2 hr and finally throwing some exception
saying the one node is lost 
or java.lang.IllegalArgumentException: annotator "sentiment" requires
annotator "binarized_trees" etc etc. 

The problem is only if we scale up the number of partition to 200 with 8
Nodes and 6 cores which we have only 48 cores. 

I have the suspect that its cuz of loading the shift reduce parser loading
at each partition.i thought of loading this class at one time and then do
the Broadcast but standforndNLP class is not searializable so i cannot
broadcast.any thought suggestion 

The reason we need to scale to 200 partition is it will run quickly with
lesser time to process this data. Any thoughts suggestion is relly helpful



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-partition-issue-with-Stanford-NLP-tp23048p23057.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to