Re: Best way to add a huge dataset?

Carl Fyffe Fri, 12 Mar 2010 06:54:47 -0800

Let me clarify something (I am writing on a blackberry so I couldn't
read your post and write at the sametime) that I mistakenly implied. I
said "update your live data" and what I meant to say was this:


Copy your live data as you are merging it with the new dataset to the
new tables you have created. So basically:

* bring app down
* push migrations for processing
* bring app back online (shouldn't take but a minute)
* push new dataset up while app is live. None of the current code base
can see that table so should be good
* load queue with records that need to be processed (all records)
* start background job
** job looks in queue for a record to process
** job grabs record, merges with new dataset and saves into new table
** repeat until queue is down to aproximate number of records that are
updated daily.
* put app in maintenance mode
* process remaining records as quickly as possible
* ensure data integrity
* push latest code
* migrate new tables to final resting spot
* come out of maintenance mode

I think that is morwe clear than my last attempt :-)

On 3/12/10, Carl Fyffe <[email protected]> wrote:
> This is just an idea:
>
> Instead of bringing the data down, and turning your app off for
> multiple days you could leave the app up and do all of the processing
> on Heroku. You will want to create a branch from your current
> production code, and in this branch you will create the migrations for
> the new tables. Then you create a background job that goes through
> your live data and updates it. You should probably have a table that
> has a list of the data that needs to be updated. On first load this
> will have every record in the db, and as your background job works
> through the data it removes the record. The best part is, whenever
> your live application makes a change you can have an after filter that
> puts an entry back into the queue table for that data to be migrated
> (if it isn't already in there because it hadn't been processed yet).
> When the jobs get close to being completed ( say an hours worth of
> work remains) then you take the system down, let the background job
> complete, and do a check of the data. You will have all of the old
> data still and the new data at the same time. Then you can push all of
> the other changes to the code base and turn your system back on!
>
> The benefits are:
>
> * Your system stays live while the work is being done
> * you don't have to worry about bringing data down and then back up
> * your system stays live
> * the update can take as long as you need to get it right
> * you can watch the progress by keeping an eye on the queue
> A you know all of the data will be migrated because of the queue
>
> The down sides:
> * it will probably cost more because you should up your db size in
> heroku to the max so users don't notice the impact
> * it might take longer because you don't want it to go as fast as
> possible because it might impact your live system
> * lots of moving parts
>
> Just an idea. Good luck!
>
>
> On 3/12/10, Mike <[email protected]> wrote:
>> I'm going to be adding a number of discrete, but enormous (maybe many
>> gigs each), datasets to my Heroku app's database.  In many ways, I'm
>> in a similar situation faced by Tobin in another current post, but
>> with a different question:
>> http://groups.google.com/group/heroku/browse_thread/thread/141c3ef84b22fc18
>>
>> Right now I still haven't merged the datasets into my database yet.
>> What's the best way for me to approach this?
>>
>> The lack of ability to push individual tables with taps suggests to me
>> I'm going to want to do this probably as a one shot deal, rather than
>> doing each dataset sequentially and testing that one before proceeding
>> to the next.  I'm thinking about doing a db:pull to get the current
>> state of my database, and then shutting down my application in
>> maintenance mode, running a local merge of the datasets (maybe taking
>> days I'm guessing just to process the enormous things), doing some
>> exhaustive local testing on the result, and then doing a push back to
>> Heroku (maybe taking days again), before reactivating my app.  Because
>> of their massive size, it seems like after I've done one, doing any
>> further db:pulls is going to be basically impossible.  Just the idea
>> of possibly having made a mistake in merging the datasets that I don't
>> catch until after it's been pushed to the site gives me the shivers.
>> Overall, I wonder if there could be a better way that I'm overlooking.
>>
>> One possible alternative I thought of is would it be possible to do
>> something involving creating a local bundle from my database using
>> YamlDB?  But then I'm not sure how to get the bundle back onto the
>> server and then to restore from it?  The documentation on Heroku
>> doesn't seem to really talk about that possibility.
>>
>> Also, in my case this data is integral to the application, so I'm not
>> going to be able to split it up into a separate Heroku application
>> like in Tobin's case.  Is there going to be any practical way for me
>> to be pulling just the non-dataset data from the server in order to
>> use on a development machine?
>>
>> Does anyone have any ideas on how they would approach this problem?
>> If so, I'd be filled with gratitude.
>>
>> Mike
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Heroku" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected].
>> For more options, visit this group at
>> http://groups.google.com/group/heroku?hl=en.
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Heroku" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/heroku?hl=en.

Re: Best way to add a huge dataset?

Reply via email to