hmmm interesting looks to be working magically now... :) I must have wrote some code late at night that magically fixed it and forgot. The original errors I was getting were kryo related.
The objects aren't being serialized on write to anything useful, but thats I'm sure an easy fix. Onward and upward ! On Wed, Sep 2, 2015 at 9:33 AM, Robert Metzger <rmetz...@apache.org> wrote: > Okay, I see. > > As I said before, I was not able to reproduce the serialization issue > you've reported. > Can you maybe post the exception you are seeing? > > On Wed, Sep 2, 2015 at 3:32 PM, jay vyas <jayunit100.apa...@gmail.com> > wrote: > >> Hey, thanks! >> >> Those are just seeds, the files aren't large. >> >> The scale out data is the transactions. >> >> The seed data needs to be the same, shipped to ALL nodes, and then >> >> the nodes generate transactions. >> >> >> On Wed, Sep 2, 2015 at 9:21 AM, Robert Metzger <rmetz...@apache.org> >> wrote: >> >>> I'm starting a new discussion thread for the bigpetstore-flink >>> integration ... >>> >>> >>> I took a closer look into the code you've posted. >>> It seems to me that you are generating a lot of data locally on the >>> client, before you actually submit a job to Flink. (Both "customers" and >>> "stores" are generated locally) >>> Is that only some "seed" data? >>> >>> I would actually try to generate as much data as possible in the >>> cluster, making the generator very scalable. >>> >>> I don't think that you need to register a Kryo serializer for the >>> Product and Transaction type. >>> I was able to run the code without the serializer registration. >>> >>> >>> ---------- Forwarded message ---------- >>> From: jay vyas <jayunit100.apa...@gmail.com> >>> Date: Wed, Sep 2, 2015 at 2:56 PM >>> Subject: Re: Hardware requirements and learning resources >>> To: user@flink.apache.org >>> >>> >>> We're also working on a bigpetstore implementation of flink which will >>> help onboard spark/mapreduce folks. >>> >>> I have prototypical code here that runs a simple job in memory, >>> contributions welcome, >>> >>> right now there is a serialization error >>> https://github.com/bigpetstore/bigpetstore-flink . >>> >>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger <rmetz...@apache.org> >>> wrote: >>> >>>> Hi Juan, >>>> >>>> I think the recommendations in the Spark guide are quite good, and are >>>> similar to what I would recommend for Flink as well. >>>> Depending on the workloads you are interested to run, you can certainly >>>> use Flink with less than 8 GB per machine. I think you can start Flink >>>> TaskManagers with 500 MB of heap space and they'll still be able to process >>>> some GB of data. >>>> >>>> Everything above 2 GB is probably good enough for some initial >>>> experimentation (again depending on your workloads, network, disk speed >>>> etc.) >>>> >>>> >>>> >>>> >>>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <ktzou...@apache.org> >>>> wrote: >>>> >>>>> Hi Juan, >>>>> >>>>> Flink is quite nimble with hardware requirements; people have run it >>>>> in old-ish laptops and also the largest instances available in cloud >>>>> providers. I will let others chime in with more details. >>>>> >>>>> I am not aware of something along the lines of a cheatsheet that you >>>>> mention. If you actually try to do this, I would love to see it, and it >>>>> might be useful to others as well. Both use similar abstractions at the >>>>> API >>>>> level (i.e., parallel collections), so if you stay true to the functional >>>>> paradigm and not try to "abuse" the system by exploiting knowledge of its >>>>> internals things should be straightforward. These apply to the batch APIs; >>>>> the streaming API in Flink follows a true streaming paradigm, where you >>>>> get >>>>> an unbounded stream of records and operators on these streams. >>>>> >>>>> Funny that you ask about a video for the DataStream slides. There is a >>>>> Flink training happening as we speak, and a video is being recorded right >>>>> now :-) Hopefully it will be made available soon. >>>>> >>>>> Best, >>>>> Kostas >>>>> >>>>> >>>>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá < >>>>> juan.rodriguez.hort...@gmail.com> wrote: >>>>> >>>>>> Answering to myself, I have found some nice training material at >>>>>> http://dataartisans.github.io/flink-training. There are even videos >>>>>> at youtube for some of the slides >>>>>> >>>>>> - http://dataartisans.github.io/flink-training/overview/intro.html >>>>>> https://www.youtube.com/watch?v=XgC6c4Wiqvs >>>>>> >>>>>> - >>>>>> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html >>>>>> https://www.youtube.com/watch?v=0EARqW15dDk >>>>>> >>>>>> The third lecture >>>>>> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html >>>>>> more or less corresponds to >>>>>> https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and >>>>>> there are more lessons at >>>>>> http://dataartisans.github.io/flink-training, for stream processing >>>>>> and the table API for which I haven't found a video. Does anyone have >>>>>> pointers to the missing videos? >>>>>> >>>>>> Greetings, >>>>>> >>>>>> Juan >>>>>> >>>>>> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá < >>>>>> juan.rodriguez.hort...@gmail.com>: >>>>>> >>>>>>> Hi list, >>>>>>> >>>>>>> I'm new to Flink, and I find this project very interesting. I have >>>>>>> experience with Apache Spark, and for I've seen so far I find that Flink >>>>>>> provides an API at a similar abstraction level but based on single >>>>>>> record >>>>>>> processing instead of batch processing. I've read in Quora that Flink >>>>>>> extends stream processing to batch processing, while Spark extends batch >>>>>>> processing to streaming. Therefore I find Flink specially attractive for >>>>>>> low latency stream processing. Anyway, I would appreciate if someone >>>>>>> could >>>>>>> give some indication about where I could find a list of hardware >>>>>>> requirements for the slave nodes in a Flink cluster. Something along the >>>>>>> lines of >>>>>>> https://spark.apache.org/docs/latest/hardware-provisioning.html. >>>>>>> Spark is known for having quite high minimal memory requirements (8GB >>>>>>> RAM >>>>>>> and 8 cores minimum), and I was wondering if it is also the case for >>>>>>> Flink. >>>>>>> Lower memory requirements would be very interesting for building small >>>>>>> Flink clusters for educational purposes, or for small projects. >>>>>>> >>>>>>> Apart from that, I wonder if there is some blog post by the comunity >>>>>>> about transitioning from Spark to Flink. I think it could be >>>>>>> interesting, >>>>>>> as there are some similarities in the APIs, but also deep differences in >>>>>>> the underlying approaches. I was thinking in something like Breeze's >>>>>>> cheatsheet comparing its matrix operatations with those available in >>>>>>> Matlab >>>>>>> and Numpy >>>>>>> https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, >>>>>>> or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. >>>>>>> Also, any pointer to some online course, book or training for Flink >>>>>>> besides >>>>>>> the official programming guides would be much appreciated >>>>>>> >>>>>>> Thanks in advance for help >>>>>>> >>>>>>> Greetings, >>>>>>> >>>>>>> Juan >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> jay vyas >>> >>> >> >> >> -- >> jay vyas >> > > -- jay vyas