Bigpetstore - Flink integration

Robert Metzger Wed, 02 Sep 2015 06:22:35 -0700

I'm starting a new discussion thread for the bigpetstore-flink integration
...



I took a closer look into the code you've posted.
It seems to me that you are generating a lot of data locally on the client,
before you actually submit a job to Flink. (Both "customers" and "stores"
are generated locally)
Is that only some "seed" data?

I would actually try to generate as much data as possible in the cluster,
making the generator very scalable.

I don't think that you need to register a Kryo serializer for the Product
and Transaction type.
I was able to run the code without the serializer registration.


---------- Forwarded message ----------
From: jay vyas <jayunit100.apa...@gmail.com>
Date: Wed, Sep 2, 2015 at 2:56 PM
Subject: Re: Hardware requirements and learning resources
To: user@flink.apache.org


We're also working on a bigpetstore implementation of flink which will help
onboard spark/mapreduce folks.

I have prototypical code here that runs a simple job in memory,
contributions welcome,

right now there is a serialization error
https://github.com/bigpetstore/bigpetstore-flink .

On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger <rmetz...@apache.org> wrote:

> Hi Juan,
>
> I think the recommendations in the Spark guide are quite good, and are
> similar to what I would recommend for Flink as well.
> Depending on the workloads you are interested to run, you can certainly
> use Flink with less than 8 GB per machine. I think you can start Flink
> TaskManagers with 500 MB of heap space and they'll still be able to process
> some GB of data.
>
> Everything above 2 GB is probably good enough for some initial
> experimentation (again depending on your workloads, network, disk speed
> etc.)
>
>
>
>
> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <ktzou...@apache.org>
> wrote:
>
>> Hi Juan,
>>
>> Flink is quite nimble with hardware requirements; people have run it in
>> old-ish laptops and also the largest instances available in cloud
>> providers. I will let others chime in with more details.
>>
>> I am not aware of something along the lines of a cheatsheet that you
>> mention. If you actually try to do this, I would love to see it, and it
>> might be useful to others as well. Both use similar abstractions at the API
>> level (i.e., parallel collections), so if you stay true to the functional
>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>> internals things should be straightforward. These apply to the batch APIs;
>> the streaming API in Flink follows a true streaming paradigm, where you get
>> an unbounded stream of records and operators on these streams.
>>
>> Funny that you ask about a video for the DataStream slides. There is a
>> Flink training happening as we speak, and a video is being recorded right
>> now :-) Hopefully it will be made available soon.
>>
>> Best,
>> Kostas
>>
>>
>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>> juan.rodriguez.hort...@gmail.com> wrote:
>>
>>> Answering to myself, I have found some nice training material at
>>> http://dataartisans.github.io/flink-training. There are even videos at
>>> youtube for some of the slides
>>>
>>>   - http://dataartisans.github.io/flink-training/overview/intro.html
>>>     https://www.youtube.com/watch?v=XgC6c4Wiqvs
>>>
>>>   -
>>> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
>>>     https://www.youtube.com/watch?v=0EARqW15dDk
>>>
>>> The third lecture
>>> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
>>> more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
>>> but not exactly, and there are more lessons at
>>> http://dataartisans.github.io/flink-training, for stream processing and
>>> the table API for which I haven't found a video. Does anyone have pointers
>>> to the missing videos?
>>>
>>> Greetings,
>>>
>>> Juan
>>>
>>> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com>:
>>>
>>>> Hi list,
>>>>
>>>> I'm new to Flink, and I find this project very interesting. I have
>>>> experience with Apache Spark, and for I've seen so far I find that Flink
>>>> provides an API at a similar abstraction level but based on single record
>>>> processing instead of batch processing. I've read in Quora that Flink
>>>> extends stream processing to batch processing, while Spark extends batch
>>>> processing to streaming. Therefore I find Flink specially attractive for
>>>> low latency stream processing. Anyway, I would appreciate if someone could
>>>> give some indication about where I could find a list of hardware
>>>> requirements for the slave nodes in a Flink cluster. Something along the
>>>> lines of
>>>> https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark
>>>> is known for having quite high minimal memory requirements (8GB RAM and 8
>>>> cores minimum), and I was wondering if it is also the case for Flink. Lower
>>>> memory requirements would be very interesting for building small Flink
>>>> clusters for educational purposes, or for small projects.
>>>>
>>>> Apart from that, I wonder if there is some blog post by the comunity
>>>> about transitioning from Spark to Flink. I think it could be interesting,
>>>> as there are some similarities in the APIs, but also deep differences in
>>>> the underlying approaches. I was thinking in something like Breeze's
>>>> cheatsheet comparing its matrix operatations with those available in Matlab
>>>> and Numpy
>>>> https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or
>>>> like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also,
>>>> any pointer to some online course, book or training for Flink besides the
>>>> official programming guides would be much appreciated
>>>>
>>>> Thanks in advance for help
>>>>
>>>> Greetings,
>>>>
>>>> Juan
>>>>
>>>>
>>>
>>
>


-- 
jay vyas

Bigpetstore - Flink integration

Reply via email to