That sounds great, Anish! Congratulations on getting a new machine. No worries, please take your time and keep us posted on your exploration! Quality is more important than quantity here.
-- Alex On Mon, Jul 4, 2016 at 10:40 PM, anish singh <anish18...@gmail.com> wrote: > Hello, > > Thanks Alex, I'm so glad that you helped. Here's update : I've ordered new > machine with more RAM and processor that should come by tomorrow. I will > attempt to use it for the common crawl data and the AWS solution that you > provided in the previous mail. I'm presently reading papers and > publications regarding analysis of common crawl data. Warcbase tool will > definitely be used. I understand that common crawl datasets are important > and I will do everything it takes to make notebooks on them, the only > tension is that it may take more time than the previous notebooks. > > Anish. > > On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <b...@apache.org> wrote: > > > Hi Anish, > > > > thanks for keeping us posted about a progress! > > > > CommonCrawl is important dataset and it would be awesome if we could > > find a way for you to build some notebooks for it though this this > > years GSoC program. > > > > How about running Zeppelin on a single big enough node in AWS for the > > sake of this notebook? > > If you use spot instance you could get even big instances for really > > affordable price of 2-4$ a day, just need to make sure your persist > > notebooks on S3 [1] to avoid loosing the data and shut down it for the > > night. > > > > AFAIK We do not have free any AWS credits for now, even for a GSoC > > students. If somebody knows a way to provide\get some - please feel > > free to chime in, I know there are some Amazonian people on the list > > :) > > > > But so far AWS spot instances is the most cost-effective solution I > > could imagine of. Bonus: if you host your instance in region us-east-1 > > - transfer from\to S3 will be free, as that's where CommonCrawl > > dataset is living. > > > > One more thing - please check out awesome WarcBase library [2] build > > by internet preservation community. I find it really helpful, working > > with web archives. > > > > On the notebook design: > > - to understand the context of this dataset better - please do some > > research how other people use it. What for, etc. > > Would be a great material for the blog post > > - try provide examples of all available formats: WARC, WET, WAT (in > > may be in same or different notebooks, it's up to you) > > - while using warcbase - mind that RDD persistence will not work > > until [3] is resolved, so avoid using if for now > > > > I understand that this can be a big task, so do not worry if that > > takes time (learning AWS, etc) - just keep us posted on your progress > > weekly and I'll be glad to help! > > > > > > 1. > > > http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3 > > 2. https://github.com/lintool/warcbase > > 3. https://github.com/lintool/warcbase/issues/227 > > > > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <anish18...@gmail.com> > wrote: > > > Hello, > > > > > > (everything outside Zeppelin) > > > I had started work on the common crawl datasets, and tried to first > have > > a > > > look at only the data for May 2016. Out of the three formats > available, I > > > chose the WET(plain text format). The data only for May is divided into > > > segments and there are 24492 such segments. I downloaded only the first > > > segment for May and got 432MB of data. Now the problem is that my > laptop > > is > > > a very modest machine with core 2 duo processor and 3GB of RAM such > that > > > even opening the downloaded data file in LibreWriter filled the RAM > > > completely and hung the machine and bringing the data directly into > > > zeppelin or analyzing it inside zeppelin seems impossible. As good as I > > > know, there are two ways in which I can proceed : > > > > > > 1) Buying a new laptop with more RAM and processor. OR > > > 2) Choosing another dataset > > > > > > I have no problem with either of the above ways or anything that you > > might > > > suggest but please let me know which way to proceed so that I may be > able > > > to work in speed. Meanwhile, I will read more papers and publications > on > > > possibilities of analyzing common crawl data. > > > > > > Thanks, > > > Anish. > > >