Re: Best way to add a huge dataset?

Mike Fri, 12 Mar 2010 16:45:30 -0800

That's quite a clever approach. So under your conception I'd still be
using db:push to get the data on, right?  It'd just be sitting in some
tables that aren't used by the application yet? Interesting, so the
site can access it's database and function in the middle of a db:push?


I'm really impressed you wrote something as long as your first post on
a blackberry!

On Mar 12, 9:54 am, Carl Fyffe <[email protected]> wrote:
> Let me clarify something (I am writing on a blackberry so I couldn't
> read your post and write at the sametime) that I mistakenly implied. I
> said "update your live data" and what I meant to say was this:
>
> Copy your live data as you are merging it with the new dataset to the
> new tables you have created. So basically:
>
> * bring app down
> * push migrations for processing
> * bring app back online (shouldn't take but a minute)
> * push new dataset up while app is live. None of the current code base
> can see that table so should be good
> * load queue with records that need to be processed (all records)
> * start background job
> ** job looks in queue for a record to process
> ** job grabs record, merges with new dataset and saves into new table
> ** repeat until queue is down to aproximate number of records that are
> updated daily.
> * put app in maintenance mode
> * process remaining records as quickly as possible
> * ensure data integrity
> * push latest code
> * migrate new tables to final resting spot
> * come out of maintenance mode
>
> I think that is morwe clear than my last attempt :-)
>
> On 3/12/10, Carl Fyffe <[email protected]> wrote:
>
> > This is just an idea:
>
> > Instead of bringing the data down, and turning your app off for
> > multiple days you could leave the app up and do all of the processing
> > on Heroku. You will want to create a branch from your current
> > production code, and in this branch you will create the migrations for
> > the new tables. Then you create a background job that goes through
> > your live data and updates it. You should probably have a table that
> > has a list of the data that needs to be updated. On first load this
> > will have every record in the db, and as your background job works
> > through the data it removes the record. The best part is, whenever
> > your live application makes a change you can have an after filter that
> > puts an entry back into the queue table for that data to be migrated
> > (if it isn't already in there because it hadn't been processed yet).
> > When the jobs get close to being completed ( say an hours worth of
> > work remains) then you take the system down, let the background job
> > complete, and do a check of the data. You will have all of the old
> > data still and the new data at the same time. Then you can push all of
> > the other changes to the code base and turn your system back on!
>
> > The benefits are:
>
> > * Your system stays live while the work is being done
> > * you don't have to worry about bringing data down and then back up
> > * your system stays live
> > * the update can take as long as you need to get it right
> > * you can watch the progress by keeping an eye on the queue
> > A you know all of the data will be migrated because of the queue
>
> > The down sides:
> > * it will probably cost more because you should up your db size in
> > heroku to the max so users don't notice the impact
> > * it might take longer because you don't want it to go as fast as
> > possible because it might impact your live system
> > * lots of moving parts
>
> > Just an idea. Good luck!
>
> > On 3/12/10, Mike <[email protected]> wrote:
> >> I'm going to be adding a number of discrete, but enormous (maybe many
> >> gigs each), datasets to my Heroku app's database.  In many ways, I'm
> >> in a similar situation faced by Tobin in another current post, but
> >> with a different question:
> >>http://groups.google.com/group/heroku/browse_thread/thread/141c3ef84b...
>
> >> Right now I still haven't merged the datasets into my database yet.
> >> What's the best way for me to approach this?
>
> >> The lack of ability to push individual tables with taps suggests to me
> >> I'm going to want to do this probably as a one shot deal, rather than
> >> doing each dataset sequentially and testing that one before proceeding
> >> to the next.  I'm thinking about doing a db:pull to get the current
> >> state of my database, and then shutting down my application in
> >> maintenance mode, running a local merge of the datasets (maybe taking
> >> days I'm guessing just to process the enormous things), doing some
> >> exhaustive local testing on the result, and then doing a push back to
> >> Heroku (maybe taking days again), before reactivating my app.  Because
> >> of their massive size, it seems like after I've done one, doing any
> >> further db:pulls is going to be basically impossible.  Just the idea
> >> of possibly having made a mistake in merging the datasets that I don't
> >> catch until after it's been pushed to the site gives me the shivers.
> >> Overall, I wonder if there could be a better way that I'm overlooking.
>
> >> One possible alternative I thought of is would it be possible to do
> >> something involving creating a local bundle from my database using
> >> YamlDB?  But then I'm not sure how to get the bundle back onto the
> >> server and then to restore from it?  The documentation on Heroku
> >> doesn't seem to really talk about that possibility.
>
> >> Also, in my case this data is integral to the application, so I'm not
> >> going to be able to split it up into a separate Heroku application
> >> like in Tobin's case.  Is there going to be any practical way for me
> >> to be pulling just the non-dataset data from the server in order to
> >> use on a development machine?
>
> >> Does anyone have any ideas on how they would approach this problem?
> >> If so, I'd be filled with gratitude.
>
> >> Mike
>
> >> --
> >> You received this message because you are subscribed to the Google Groups
> >> "Heroku" group.
> >> To post to this group, send email to [email protected].
> >> To unsubscribe from this group, send email to
> >> [email protected].
> >> For more options, visit this group at
> >>http://groups.google.com/group/heroku?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Heroku" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/heroku?hl=en.

Re: Best way to add a huge dataset?

Reply via email to