I'm starting a new discussion thread for the bigpetstore-flink integration ...
I took a closer look into the code you've posted. It seems to me that you are generating a lot of data locally on the client, before you actually submit a job to Flink. (Both "customers" and "stores" are generated locally) Is that only some "seed" data? I would actually try to generate as much data as possible in the cluster, making the generator very scalable. I don't think that you need to register a Kryo serializer for the Product and Transaction type. I was able to run the code without the serializer registration. ---------- Forwarded message ---------- From: jay vyas <jayunit100.apa...@gmail.com> Date: Wed, Sep 2, 2015 at 2:56 PM Subject: Re: Hardware requirements and learning resources To: user@flink.apache.org We're also working on a bigpetstore implementation of flink which will help onboard spark/mapreduce folks. I have prototypical code here that runs a simple job in memory, contributions welcome, right now there is a serialization error https://github.com/bigpetstore/bigpetstore-flink . On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger <rmetz...@apache.org> wrote: > Hi Juan, > > I think the recommendations in the Spark guide are quite good, and are > similar to what I would recommend for Flink as well. > Depending on the workloads you are interested to run, you can certainly > use Flink with less than 8 GB per machine. I think you can start Flink > TaskManagers with 500 MB of heap space and they'll still be able to process > some GB of data. > > Everything above 2 GB is probably good enough for some initial > experimentation (again depending on your workloads, network, disk speed > etc.) > > > > > On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <ktzou...@apache.org> > wrote: > >> Hi Juan, >> >> Flink is quite nimble with hardware requirements; people have run it in >> old-ish laptops and also the largest instances available in cloud >> providers. I will let others chime in with more details. >> >> I am not aware of something along the lines of a cheatsheet that you >> mention. If you actually try to do this, I would love to see it, and it >> might be useful to others as well. Both use similar abstractions at the API >> level (i.e., parallel collections), so if you stay true to the functional >> paradigm and not try to "abuse" the system by exploiting knowledge of its >> internals things should be straightforward. These apply to the batch APIs; >> the streaming API in Flink follows a true streaming paradigm, where you get >> an unbounded stream of records and operators on these streams. >> >> Funny that you ask about a video for the DataStream slides. There is a >> Flink training happening as we speak, and a video is being recorded right >> now :-) Hopefully it will be made available soon. >> >> Best, >> Kostas >> >> >> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá < >> juan.rodriguez.hort...@gmail.com> wrote: >> >>> Answering to myself, I have found some nice training material at >>> http://dataartisans.github.io/flink-training. There are even videos at >>> youtube for some of the slides >>> >>> - http://dataartisans.github.io/flink-training/overview/intro.html >>> https://www.youtube.com/watch?v=XgC6c4Wiqvs >>> >>> - >>> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html >>> https://www.youtube.com/watch?v=0EARqW15dDk >>> >>> The third lecture >>> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html >>> more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU >>> but not exactly, and there are more lessons at >>> http://dataartisans.github.io/flink-training, for stream processing and >>> the table API for which I haven't found a video. Does anyone have pointers >>> to the missing videos? >>> >>> Greetings, >>> >>> Juan >>> >>> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá < >>> juan.rodriguez.hort...@gmail.com>: >>> >>>> Hi list, >>>> >>>> I'm new to Flink, and I find this project very interesting. I have >>>> experience with Apache Spark, and for I've seen so far I find that Flink >>>> provides an API at a similar abstraction level but based on single record >>>> processing instead of batch processing. I've read in Quora that Flink >>>> extends stream processing to batch processing, while Spark extends batch >>>> processing to streaming. Therefore I find Flink specially attractive for >>>> low latency stream processing. Anyway, I would appreciate if someone could >>>> give some indication about where I could find a list of hardware >>>> requirements for the slave nodes in a Flink cluster. Something along the >>>> lines of >>>> https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark >>>> is known for having quite high minimal memory requirements (8GB RAM and 8 >>>> cores minimum), and I was wondering if it is also the case for Flink. Lower >>>> memory requirements would be very interesting for building small Flink >>>> clusters for educational purposes, or for small projects. >>>> >>>> Apart from that, I wonder if there is some blog post by the comunity >>>> about transitioning from Spark to Flink. I think it could be interesting, >>>> as there are some similarities in the APIs, but also deep differences in >>>> the underlying approaches. I was thinking in something like Breeze's >>>> cheatsheet comparing its matrix operatations with those available in Matlab >>>> and Numpy >>>> https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or >>>> like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, >>>> any pointer to some online course, book or training for Flink besides the >>>> official programming guides would be much appreciated >>>> >>>> Thanks in advance for help >>>> >>>> Greetings, >>>> >>>> Juan >>>> >>>> >>> >> > -- jay vyas