sc.wholeTextFile() is what you need. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles
On Thu, Mar 12, 2015 at 9:26 AM, yh18190 <yh18...@gmail.com> wrote: > Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark + > Beautifulsoup to parse HTML files.I am facing problems to load html file > into beautiful soup. > Example > filepath= file:///path to html directory > def readhtml(inputhtml): > { > soup=Beautifulsoup(inputhtml) //to load html content > } > loaddata=sc.textFile(filepath).map(readhtml) > > The problem is here spark considers loaded file as textfile and goes through > process line by line.I want to consider to load the entire html content into > Beautifulsoup for further processing.. > Does anyone have any idea to how to take the whole html file as input > instead of linebyline processing? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-consider-HTML-files-in-Spark-tp22017.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org