Hi Anish, thank you for sharing your progress and totally know what you mean - that's an expected pain of working with real BigData.
I would advise to conduct a series of experiments: *1 moderate machine*, Spark 1.6 in local mode, 1 WARC input file (1Gb) - Spark in local mode is a single JVM process, so fine-tune it and make sure it uses ALL available memory (i.e 16Gb) - We are not going to use in-memory caching, so storage part can be turned off [1] and [2] - AFIAK DataFrames use memory more efficient than RDDs but not sure if we can benefit from it here - Start with something simple, like `val mayBegLinks = mayBegData.keepValidPages().count()` and make sure it works - Proceed further until few more complex queries work *Cluster of N machines*, Spark 1.6 in standalone cluster mode - process fraction of the whole dataset i.e 1 segment I know that is not easy, but it's worth to try for 1 more week and see if the approach outlined above works. Last, but not least - do not hesitate to reach out to CommonCrawl community [3] for an advice, there are people using Apache Spark there as well. Please keep us posted! 1. http://spark.apache.org/docs/latest/tuning.html#memory-management-overview 2. http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ 3. https://groups.google.com/forum/#!forum/common-crawl -- Alex On Wed, Jul 20, 2016 at 2:27 AM, anish singh <anish18...@gmail.com> wrote: > Hello, > > The last two weeks have been tough and full of learning, the code in the > previous mail which performed only simple transformation and reduceByKey() > to count similar domain links did not work even on the first segment(1005 > MB) of data. So I studied and read extensively on the web : blogs(cloudera, > databricks and stack overflow) and books on Spark, tried all the options > and configurations on memory and performance tuning but the code did not > run. My current configurations to SPARK_SUBMIT_OPTIONS are set to > "--driver-memory 9g --driver-java-options -XX:+UseG1GC > -XX:+UseCompressedOops --conf spark.storage.memoryFraction=0.1" and even > this does not work. Even simple operations such as rdd.count() after the > transformations in the previous mail does not work. All this on an > m4.xlarge machine. > > Moreover, in trying to set up standalone cluster on single machine by > following instructions in the book 'Learning Spark', I messed with file > '~/.ssh/authorized_keys' file which cut me out of the instance so I had to > terminate it and start all over again after losing all the work done in one > week. > > Today, I performed a comparison of memory and cpu load values using the > size of data and the machine configurations between two conditions: (when I > worked on my local machine) vs. (m4.xlarge single instance), where > > memory load = (data size) / (memory available for processing), > cpu load = (data size) / (cores available for processing) > > the results of the comparison indicate that with the amount of data, the > AWS instance is 100 times more constrained than the analysis that I > previously did on my machine (for calculations, please see sheet [0] ). > This has completely stalled work as I'm unable to perform any further > operations on the data sets. Further, choosing another instance (such as 32 > GiB) may also not be sufficient (as per calculations in [0]). Please let me > know if I'm missing something or how to proceed with this. > > [0]. https://drive.google.com/open?id=0ByXTtaL2yHBuYnJSNGt6T2U2RjQ > > Thanks, > Anish. > > > > On Tue, Jul 12, 2016 at 12:35 PM, anish singh <anish18...@gmail.com> > wrote: > > > Hello, > > > > I had been able to setup zeppelin with spark on aws ec2 m4.xlarge > instance > > a few days ago. In designing the notebook, I was trying to visualize the > > link structure by the following code : > > > > val mayBegLinks = mayBegData.keepValidPages() > > .flatMap(r => ExtractLinks(r.getUrl, > > r.getContentString)) > > .map(r => (ExtractDomain(r._1), > > ExtractDomain(r._2))) > > .filter(r => (r._1.equals("www.fangraphs.com > ") > > || r._1.equals("www.osnews.com") || r._1.equals("www.dailytech.com"))) > > > > val linkWtMap = mayBegLinks.map(r => (r, 1)).reduceByKey((x, y) => x + y) > > linkWtMap.toDF().registerTempTable("LnkWtTbl") > > > > where 'mayBegData' is some 2GB of WARC for the first two segments of May. > > This paragraph runs smoothly but in the next paragraph using %sql and the > > following statement :- > > > > select W._1 as Links, W._2 as Weight from LnkWtTbl W > > > > I get errors which are always java.lang.OutOfMemoryError because of > > Garbage Collection space exceeded or heap space exceeded and the most > > recent one is the following: > > > > org.apache.thrift.transport.TTransportException at > > > org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) > > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at > > > org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) > > at > > > org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) > > at > > > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) > > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) > at > > > org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:261) > > at > > > org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:245) > > at > > > org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:312) > > at > > > org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93) > > at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:271) at > > org.apache.zeppelin.scheduler.Job.run(Job.java:176) at > > > org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329) > > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > > > I just wanted to know that even with m4.xlarge instance, is it not > > possible to process such large(~ 2GB) of data because the above code is > > relatively simple, I guess. This is restricting the flexibility with > which > > the notebook can be designed. Please provide some hints/suggestions since > > I'm stuck on this since yesterday. > > > > Thanks, > > Anish. > > > > > > On Tue, Jul 5, 2016 at 12:28 PM, Alexander Bezzubov <b...@apache.org> > > wrote: > > > >> That sounds great, Anish! > >> Congratulations on getting a new machine. > >> > >> No worries, please take your time and keep us posted on your > exploration! > >> Quality is more important than quantity here. > >> > >> -- > >> Alex > >> > >> On Mon, Jul 4, 2016 at 10:40 PM, anish singh <anish18...@gmail.com> > >> wrote: > >> > >> > Hello, > >> > > >> > Thanks Alex, I'm so glad that you helped. Here's update : I've ordered > >> new > >> > machine with more RAM and processor that should come by tomorrow. I > will > >> > attempt to use it for the common crawl data and the AWS solution that > >> you > >> > provided in the previous mail. I'm presently reading papers and > >> > publications regarding analysis of common crawl data. Warcbase tool > will > >> > definitely be used. I understand that common crawl datasets are > >> important > >> > and I will do everything it takes to make notebooks on them, the only > >> > tension is that it may take more time than the previous notebooks. > >> > > >> > Anish. > >> > > >> > On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <b...@apache.org> > >> wrote: > >> > > >> > > Hi Anish, > >> > > > >> > > thanks for keeping us posted about a progress! > >> > > > >> > > CommonCrawl is important dataset and it would be awesome if we could > >> > > find a way for you to build some notebooks for it though this this > >> > > years GSoC program. > >> > > > >> > > How about running Zeppelin on a single big enough node in AWS for > the > >> > > sake of this notebook? > >> > > If you use spot instance you could get even big instances for really > >> > > affordable price of 2-4$ a day, just need to make sure your persist > >> > > notebooks on S3 [1] to avoid loosing the data and shut down it for > the > >> > > night. > >> > > > >> > > AFAIK We do not have free any AWS credits for now, even for a GSoC > >> > > students. If somebody knows a way to provide\get some - please feel > >> > > free to chime in, I know there are some Amazonian people on the list > >> > > :) > >> > > > >> > > But so far AWS spot instances is the most cost-effective solution I > >> > > could imagine of. Bonus: if you host your instance in region > us-east-1 > >> > > - transfer from\to S3 will be free, as that's where CommonCrawl > >> > > dataset is living. > >> > > > >> > > One more thing - please check out awesome WarcBase library [2] build > >> > > by internet preservation community. I find it really helpful, > working > >> > > with web archives. > >> > > > >> > > On the notebook design: > >> > > - to understand the context of this dataset better - please do some > >> > > research how other people use it. What for, etc. > >> > > Would be a great material for the blog post > >> > > - try provide examples of all available formats: WARC, WET, WAT (in > >> > > may be in same or different notebooks, it's up to you) > >> > > - while using warcbase - mind that RDD persistence will not work > >> > > until [3] is resolved, so avoid using if for now > >> > > > >> > > I understand that this can be a big task, so do not worry if that > >> > > takes time (learning AWS, etc) - just keep us posted on your > progress > >> > > weekly and I'll be glad to help! > >> > > > >> > > > >> > > 1. > >> > > > >> > > >> > http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3 > >> > > 2. https://github.com/lintool/warcbase > >> > > 3. https://github.com/lintool/warcbase/issues/227 > >> > > > >> > > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <anish18...@gmail.com> > >> > wrote: > >> > > > Hello, > >> > > > > >> > > > (everything outside Zeppelin) > >> > > > I had started work on the common crawl datasets, and tried to > first > >> > have > >> > > a > >> > > > look at only the data for May 2016. Out of the three formats > >> > available, I > >> > > > chose the WET(plain text format). The data only for May is divided > >> into > >> > > > segments and there are 24492 such segments. I downloaded only the > >> first > >> > > > segment for May and got 432MB of data. Now the problem is that my > >> > laptop > >> > > is > >> > > > a very modest machine with core 2 duo processor and 3GB of RAM > such > >> > that > >> > > > even opening the downloaded data file in LibreWriter filled the > RAM > >> > > > completely and hung the machine and bringing the data directly > into > >> > > > zeppelin or analyzing it inside zeppelin seems impossible. As good > >> as I > >> > > > know, there are two ways in which I can proceed : > >> > > > > >> > > > 1) Buying a new laptop with more RAM and processor. OR > >> > > > 2) Choosing another dataset > >> > > > > >> > > > I have no problem with either of the above ways or anything that > you > >> > > might > >> > > > suggest but please let me know which way to proceed so that I may > be > >> > able > >> > > > to work in speed. Meanwhile, I will read more papers and > >> publications > >> > on > >> > > > possibilities of analyzing common crawl data. > >> > > > > >> > > > Thanks, > >> > > > Anish. > >> > > > >> > > >> > > > > >