Re: Bigpetstore - Flink integration

jay vyas Wed, 02 Sep 2015 07:26:01 -0700

hmmm interesting looks to be working magically now... :)  I must have wrote
some code late at night that magically fixed it and forgot.  The original
errors I was getting were kryo related.


The objects aren't being serialized on write to anything useful, but thats
I'm sure an easy fix.

Onward and upward !

On Wed, Sep 2, 2015 at 9:33 AM, Robert Metzger <rmetz...@apache.org> wrote:

> Okay, I see.
>
> As I said before, I was not able to reproduce the serialization issue
> you've reported.
> Can you maybe post the exception you are seeing?
>
> On Wed, Sep 2, 2015 at 3:32 PM, jay vyas <jayunit100.apa...@gmail.com>
> wrote:
>
>> Hey, thanks!
>>
>> Those are just seeds, the files aren't large.
>>
>> The scale out data is the transactions.
>>
>> The seed data needs to be the same, shipped to ALL nodes, and then
>>
>> the nodes generate transactions.
>>
>>
>> On Wed, Sep 2, 2015 at 9:21 AM, Robert Metzger <rmetz...@apache.org>
>> wrote:
>>
>>> I'm starting a new discussion thread for the bigpetstore-flink
>>> integration ...
>>>
>>>
>>> I took a closer look into the code you've posted.
>>> It seems to me that you are generating a lot of data locally on the
>>> client, before you actually submit a job to Flink. (Both "customers" and
>>> "stores" are generated locally)
>>> Is that only some "seed" data?
>>>
>>> I would actually try to generate as much data as possible in the
>>> cluster, making the generator very scalable.
>>>
>>> I don't think that you need to register a Kryo serializer for the
>>> Product and Transaction type.
>>> I was able to run the code without the serializer registration.
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: jay vyas <jayunit100.apa...@gmail.com>
>>> Date: Wed, Sep 2, 2015 at 2:56 PM
>>> Subject: Re: Hardware requirements and learning resources
>>> To: user@flink.apache.org
>>>
>>>
>>> We're also working on a bigpetstore implementation of flink which will
>>> help onboard spark/mapreduce folks.
>>>
>>> I have prototypical code here that runs a simple job in memory,
>>> contributions welcome,
>>>
>>> right now there is a serialization error
>>> https://github.com/bigpetstore/bigpetstore-flink .
>>>
>>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger <rmetz...@apache.org>
>>> wrote:
>>>
>>>> Hi Juan,
>>>>
>>>> I think the recommendations in the Spark guide are quite good, and are
>>>> similar to what I would recommend for Flink as well.
>>>> Depending on the workloads you are interested to run, you can certainly
>>>> use Flink with less than 8 GB per machine. I think you can start Flink
>>>> TaskManagers with 500 MB of heap space and they'll still be able to process
>>>> some GB of data.
>>>>
>>>> Everything above 2 GB is probably good enough for some initial
>>>> experimentation (again depending on your workloads, network, disk speed
>>>> etc.)
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <ktzou...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Juan,
>>>>>
>>>>> Flink is quite nimble with hardware requirements; people have run it
>>>>> in old-ish laptops and also the largest instances available in cloud
>>>>> providers. I will let others chime in with more details.
>>>>>
>>>>> I am not aware of something along the lines of a cheatsheet that you
>>>>> mention. If you actually try to do this, I would love to see it, and it
>>>>> might be useful to others as well. Both use similar abstractions at the 
>>>>> API
>>>>> level (i.e., parallel collections), so if you stay true to the functional
>>>>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>>>>> internals things should be straightforward. These apply to the batch APIs;
>>>>> the streaming API in Flink follows a true streaming paradigm, where you 
>>>>> get
>>>>> an unbounded stream of records and operators on these streams.
>>>>>
>>>>> Funny that you ask about a video for the DataStream slides. There is a
>>>>> Flink training happening as we speak, and a video is being recorded right
>>>>> now :-) Hopefully it will be made available soon.
>>>>>
>>>>> Best,
>>>>> Kostas
>>>>>
>>>>>
>>>>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>>>>> juan.rodriguez.hort...@gmail.com> wrote:
>>>>>
>>>>>> Answering to myself, I have found some nice training material at
>>>>>> http://dataartisans.github.io/flink-training. There are even videos
>>>>>> at youtube for some of the slides
>>>>>>
>>>>>>   - http://dataartisans.github.io/flink-training/overview/intro.html
>>>>>>     https://www.youtube.com/watch?v=XgC6c4Wiqvs
>>>>>>
>>>>>>   -
>>>>>> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
>>>>>>     https://www.youtube.com/watch?v=0EARqW15dDk
>>>>>>
>>>>>> The third lecture
>>>>>> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
>>>>>> more or less corresponds to
>>>>>> https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and
>>>>>> there are more lessons at
>>>>>> http://dataartisans.github.io/flink-training, for stream processing
>>>>>> and the table API for which I haven't found a video. Does anyone have
>>>>>> pointers to the missing videos?
>>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>> Juan
>>>>>>
>>>>>> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
>>>>>> juan.rodriguez.hort...@gmail.com>:
>>>>>>
>>>>>>> Hi list,
>>>>>>>
>>>>>>> I'm new to Flink, and I find this project very interesting. I have
>>>>>>> experience with Apache Spark, and for I've seen so far I find that Flink
>>>>>>> provides an API at a similar abstraction level but based on single 
>>>>>>> record
>>>>>>> processing instead of batch processing. I've read in Quora that Flink
>>>>>>> extends stream processing to batch processing, while Spark extends batch
>>>>>>> processing to streaming. Therefore I find Flink specially attractive for
>>>>>>> low latency stream processing. Anyway, I would appreciate if someone 
>>>>>>> could
>>>>>>> give some indication about where I could find a list of hardware
>>>>>>> requirements for the slave nodes in a Flink cluster. Something along the
>>>>>>> lines of
>>>>>>> https://spark.apache.org/docs/latest/hardware-provisioning.html.
>>>>>>> Spark is known for having quite high minimal memory requirements (8GB 
>>>>>>> RAM
>>>>>>> and 8 cores minimum), and I was wondering if it is also the case for 
>>>>>>> Flink.
>>>>>>> Lower memory requirements would be very interesting for building small
>>>>>>> Flink clusters for educational purposes, or for small projects.
>>>>>>>
>>>>>>> Apart from that, I wonder if there is some blog post by the comunity
>>>>>>> about transitioning from Spark to Flink. I think it could be 
>>>>>>> interesting,
>>>>>>> as there are some similarities in the APIs, but also deep differences in
>>>>>>> the underlying approaches. I was thinking in something like Breeze's
>>>>>>> cheatsheet comparing its matrix operatations with those available in 
>>>>>>> Matlab
>>>>>>> and Numpy
>>>>>>> https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet,
>>>>>>> or like http://rosettacode.org/wiki/Factorial. Just an idea anyway.
>>>>>>> Also, any pointer to some online course, book or training for Flink 
>>>>>>> besides
>>>>>>> the official programming guides would be much appreciated
>>>>>>>
>>>>>>> Thanks in advance for help
>>>>>>>
>>>>>>> Greetings,
>>>>>>>
>>>>>>> Juan
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> jay vyas
>>>
>>>
>>
>>
>> --
>> jay vyas
>>
>
>


-- 
jay vyas

Re: Bigpetstore - Flink integration

Reply via email to