Re: How to consider HTML files in Spark

Davies Liu Thu, 12 Mar 2015 10:39:08 -0700

sc.wholeTextFile() is what you need.

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles


On Thu, Mar 12, 2015 at 9:26 AM, yh18190 <yh18...@gmail.com> wrote:
> Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark +
> Beautifulsoup to parse HTML files.I am facing problems to load html file
> into beautiful soup.
> Example
> filepath= file:///path to html directory
> def readhtml(inputhtml):
> {
> soup=Beautifulsoup(inputhtml) //to load html content
> }
> loaddata=sc.textFile(filepath).map(readhtml)
>
> The problem is here spark considers loaded file as textfile and goes through
> process line by line.I want to consider to load the entire html content into
> Beautifulsoup for further processing..
> Does anyone have any idea to how to take the whole html file as input
> instead of linebyline processing?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-consider-HTML-files-in-Spark-tp22017.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to consider HTML files in Spark

Reply via email to