Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

anish singh Tue, 12 Jul 2016 00:06:07 -0700

Hello,

I had been able to setup zeppelin with spark on aws ec2 m4.xlarge instance
a few days ago. In designing the notebook, I was trying to visualize the
link structure by the following code :


val mayBegLinks = mayBegData.keepValidPages()
                            .flatMap(r => ExtractLinks(r.getUrl,
r.getContentString))
                            .map(r => (ExtractDomain(r._1),
ExtractDomain(r._2)))
                            .filter(r => (r._1.equals("www.fangraphs.com")
|| r._1.equals("www.osnews.com") ||   r._1.equals("www.dailytech.com")))

val linkWtMap = mayBegLinks.map(r => (r, 1)).reduceByKey((x, y) => x + y)
linkWtMap.toDF().registerTempTable("LnkWtTbl")

where 'mayBegData' is some 2GB of WARC for the first two segments of May.
This paragraph runs smoothly but in the next paragraph using %sql and the
following statement :-

select W._1 as Links, W._2 as Weight from LnkWtTbl W

I get errors which are always java.lang.OutOfMemoryError because of Garbage
Collection space exceeded or heap space exceeded and the most recent one is
the following:

org.apache.thrift.transport.TTransportException at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:261)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:245)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:312)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:271) at
org.apache.zeppelin.scheduler.Job.run(Job.java:176) at
org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I just wanted to know that even with m4.xlarge instance, is it not possible
to process such large(~ 2GB) of data because the above code is relatively
simple, I guess. This is restricting the flexibility with which the
notebook can be designed. Please provide some hints/suggestions since I'm
stuck on this since yesterday.

Thanks,
Anish.


On Tue, Jul 5, 2016 at 12:28 PM, Alexander Bezzubov <b...@apache.org> wrote:

> That sounds great, Anish!
> Congratulations on getting a new machine.
>
> No worries, please take your time and keep us posted on your exploration!
> Quality is more important than quantity here.
>
> --
> Alex
>
> On Mon, Jul 4, 2016 at 10:40 PM, anish singh <anish18...@gmail.com> wrote:
>
> > Hello,
> >
> > Thanks Alex, I'm so glad that you helped. Here's update : I've ordered
> new
> > machine with more RAM and processor that should come by tomorrow. I will
> > attempt to use it for the common crawl data and the AWS solution that you
> > provided in the previous mail. I'm presently reading papers and
> > publications regarding analysis of common crawl data. Warcbase tool will
> > definitely be used. I understand that common crawl datasets are important
> > and I will do everything it takes to make notebooks on them, the only
> > tension is that it may take more time than the previous notebooks.
> >
> > Anish.
> >
> > On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <b...@apache.org>
> wrote:
> >
> > > Hi Anish,
> > >
> > > thanks for keeping us posted about a progress!
> > >
> > > CommonCrawl is important dataset and it would be awesome if we could
> > > find a way for you to build some notebooks for it though this this
> > > years GSoC program.
> > >
> > > How about running Zeppelin on a single big enough node in AWS for the
> > > sake of this notebook?
> > > If you use spot instance you could get even big instances for really
> > > affordable price of 2-4$ a day, just need to make sure your persist
> > > notebooks on S3 [1] to avoid loosing the data and shut down it for the
> > > night.
> > >
> > > AFAIK We do not have free any AWS credits for now, even for a GSoC
> > > students. If somebody knows a way to provide\get some - please feel
> > > free to chime in, I know there are some Amazonian people on the list
> > > :)
> > >
> > > But so far AWS spot instances is the most cost-effective solution I
> > > could imagine of. Bonus: if you host your instance in region us-east-1
> > > - transfer from\to S3 will be free, as that's where CommonCrawl
> > > dataset is living.
> > >
> > > One more thing - please check out awesome WarcBase library [2] build
> > > by internet preservation community. I find it really helpful, working
> > > with web archives.
> > >
> > > On the notebook design:
> > >  - to understand the context of this dataset better - please do some
> > > research how other people use it. What for, etc.
> > >    Would be a great material for the blog post
> > >  - try provide examples of all available formats: WARC, WET, WAT (in
> > > may be in same or different notebooks, it's up to you)
> > >  - while using warcbase - mind that RDD persistence will not work
> > > until [3] is resolved, so avoid using if for now
> > >
> > > I understand that this can be a big task, so do not worry if that
> > > takes time (learning AWS, etc) - just keep us posted on your progress
> > > weekly and I'll be glad to help!
> > >
> > >
> > >  1.
> > >
> >
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
> > >  2. https://github.com/lintool/warcbase
> > >  3. https://github.com/lintool/warcbase/issues/227
> > >
> > > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <anish18...@gmail.com>
> > wrote:
> > > > Hello,
> > > >
> > > > (everything outside Zeppelin)
> > > > I had started work on the common crawl datasets, and tried to first
> > have
> > > a
> > > > look at only the data for May 2016. Out of the three formats
> > available, I
> > > > chose the WET(plain text format). The data only for May is divided
> into
> > > > segments and there are 24492 such segments. I downloaded only the
> first
> > > > segment for May and got 432MB of data. Now the problem is that my
> > laptop
> > > is
> > > > a very modest machine with core 2 duo processor and 3GB of RAM such
> > that
> > > > even opening the downloaded data file in LibreWriter filled the RAM
> > > > completely and hung the machine and bringing the data directly into
> > > > zeppelin or analyzing it inside zeppelin seems impossible. As good
> as I
> > > > know, there are two ways in which I can proceed :
> > > >
> > > > 1) Buying a new laptop with more RAM and processor.   OR
> > > > 2) Choosing another dataset
> > > >
> > > > I have no problem with either of the above ways or anything that you
> > > might
> > > > suggest but please let me know which way to proceed so that I may be
> > able
> > > > to work in speed. Meanwhile, I will read more papers and publications
> > on
> > > > possibilities of analyzing common crawl data.
> > > >
> > > > Thanks,
> > > > Anish.
> > >
> >
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Reply via email to