On Fri, Apr 29, 2011 at 4:34 PM, Nischal Shetty <[email protected]>wrote:
> But my use case is a little different. It's not so much about displaying > data as much it is about performing operations on your follower and > following list to find deltas between the two. > > I have these two use cases that need to be satisfied : > > 1. 80% of the time, I would need to see if a set of say 100 ids is present > among the user's "follower" ids that are saved in my datastore > Is this just a binary test? Or do you need the intersection of the two sets? > 2. 20% of the time, I would need to pull all the "follower" ids and > "following" ids and find the delta of the two (basically find the > intersection) > > Relation index wouldn't help me with case 1, would it? > > And case 2 would be fast if I saved the ids in a list but case1 fails. > > Any thoughts on the best thing for this would help. At the moment, from all > the help that I've been getting from all the wonderful people on this > mailing list has made me think of this approach : > > 1. Save the hundreds of thousands of "follower" ids in sharded lists / > array (more inclined towards array due to less size and I can save lot more > ids than I can in lists so less sharding) > Right - this is the relation index approach (at least if you use lists, and make them child entities of the user in question), more or less. > 2. Have a bloom filter and use it to satisfy case 1 that is to search for a > set of 100 ids and see if they are present in the datastore > > This seems good, however, bloom filters give false positives. I'm not very > particular about them if the error rate is under 10%, but I've not been able > to figure out if that's the case with regards to bloom filters when I have > users with 200,000+ ids. > Yes, that seems like a reasonable approach. Bloom filters are very configurable; to get a false positive rate under 10%, you need about 5 bits per entry; to get one under 1% you need about 10. You could store a master bloom filter against the user, along with a count, and expand the bloom filter when it gets over a certain fullness (doubling it in size like you would a hashtable, which will require re-hashing all existing entires, but will only occur rarely). If you get a bloom filter hit, you'll need to load the shards to check for an actual intersection. An even better approach, which shouldn't require much more space, would be to store one bloom filter per shard in the master record. When you want to check for intersections, you can first perform the tests against each bloom filter, loading only the ones that had at least one hit. This has the advantage of reducing the number of shards you need to load for heavy users, as well as not requiring you to resize the bloom filter, since you know the maximum size of a shard. In the worst case, though, where the IDs you're intersecting are evenly distributed amongst the shards, you won't gain a great deal with this approach. -Nick Johnson > > On 29 April 2011 10:32, Brandon Wirtz <[email protected]> wrote: > >> A really small font. >> >> >> >> >> >> *From:* [email protected] [mailto: >> [email protected]] *On Behalf Of *Nick Johnson (Google) >> *Sent:* Thursday, April 28, 2011 9:36 PM >> >> *To:* [email protected] >> *Subject:* Re: [google-appengine] Appropriate way to save hundreds of >> thousands of ids per user >> >> >> >> The question you need to ask, then, is how you want to display your data. >> You'll notice twitter doesn't provide Stephen Fry with a page that shows all >> million followers all at once; presumably you won't try and do that either >> (since it'd be totally unmanageable). Structure your data so you can fetch >> it in similar units to the way it'll be displayed. >> >> >> >> For followers, that probably means either a relation entity, or the >> relation index pattern. >> >> >> >> -Nick Johnson >> >> On Thu, Apr 28, 2011 at 1:44 PM, Nischal Shetty < >> [email protected]> wrote: >> >> I have users with more than half a million followers using my app. My own >> app account has 140k followers. I personally know a few of my app users who >> have more than 200,000 followers :) >> >> >> >> >> >> On 28 April 2011 08:44, Brandon Wirtz <[email protected]> wrote: >> >> Why not? My two biggest projects have 180k and 90k friends. >> >> >> >> *From:* [email protected] [mailto: >> [email protected]] *On Behalf Of *Nick Johnson (Google) >> *Sent:* Wednesday, April 27, 2011 7:40 PM >> >> >> *To:* [email protected] >> *Subject:* Re: [google-appengine] Appropriate way to save hundreds of >> thousands of ids per user >> >> >> >> Hi David, >> >> >> >> Can you elaborate on your exact use-case? You mentioned twitter friends, >> but I'm fairly sure no users have 200,000 friends on Twitter. >> >> >> >> -Nick Johnson >> >> On Mon, Apr 25, 2011 at 2:54 PM, David Parks <[email protected]> >> wrote: >> >> I did indeed mean pulling back a result set of say 200,000 rows. If I’m >> following the conversation correctly then what you described was storing all >> IDs, querying that one field and de-serializing all IDs into an array that >> you can then search for the ID’s you need. >> >> >> >> I like that idea. But I certainly can’t tell you if the overhead of >> reading all values, and deserializing them will be better or worse than the >> overhead of scrolling through a large result set and loading the database >> with hundreds of millions of rows. Of all databases you could be using, >> googles big table is certainly well designed for large data sets. >> >> >> >> It seems that your proposed method makes great sense when you need the >> entire result set (or close to it) for one or more users. But when you only >> need 100 results of 150,000, then the deserialization process is going to >> constitute a measurable overhead. Also, I can’t say for sure how the google >> datastore will perform when you commit hundreds of millions of rows to it. >> Of course, if small queries like are rare, then maybe it’s not so important >> to consider them. >> >> >> >> Anyway, I guess you could write, in perhaps a day or less, a very simple >> test case that populate the datastore with both scenarios and profile them. >> >> >> >> Doing the profiling work will probably give you some very useful insight >> and experience on how things will really perform in reality. >> >> >> >> I’d also suggest that you encapsulate this functionality so that you can >> easily replace one strategy with another without changing code unrelated to >> the data store (e.g. design your code using proper data access objects to >> keep this code separate from the rest of your code, and code to interfaces >> up front). >> >> >> >> >> >> >> >> *From:* [email protected] [mailto: >> [email protected]] *On Behalf Of *Nischal Shetty >> *Sent:* Monday, April 25, 2011 10:34 AM >> >> >> *To:* [email protected] >> *Subject:* Re: [google-appengine] Appropriate way to save hundreds of >> thousands of ids per user >> >> >> >> @David >> >> >> >> Querying the whole group would mean having 200,000 results for few of my >> users. Pulling all that and then searching, wouldn't that be inefficient? or >> are you talking about sharded ListProperty here? >> >> >> >> >> >> >> >> On 25 April 2011 05:41, David Parks <[email protected]> wrote: >> >> That seems like a reasonable approach. But I think you should do both >> tests. 1) let google do the work and store a lot of records, 2) query the >> whole group and parse it into an array and search the array. It wouldn’t be >> too hard to created a simple test case that populates the data for whatever >> # of users you need to plan for and profile the lookup and storage speeds of >> both. >> >> >> >> I’d love to know your results if you do test both approaches. >> >> >> >> >> >> *From:* [email protected] [mailto: >> [email protected]] *On Behalf Of *Nischal Shetty >> *Sent:* Friday, April 22, 2011 3:10 PM >> >> >> *To:* [email protected] >> >> *Subject:* Re: [google-appengine] Appropriate way to save hundreds of >> thousands of ids per user >> >> >> >> @David >> >> >> >> Thanks for the input. Every reply gives me some more insight into how I >> achieve this. My use case is as below : >> >> >> >> 1. At times I would need all the IDs at the same time in memory >> >> 2. Most of the times I would need to check if a set of IDs as input by the >> user (say 100 IDs) are present in the datastore >> >> >> >> I've been thinking of doing the following : >> >> >> >> 1. Persisting all the IDs by putting them into an array (I will probably >> have shards where each array would hold 50k IDs) >> >> 2. Implementing a bloom filter to search for the set of IDs if they exist >> in the datastore. >> >> >> >> >> >> On 22 April 2011 09:34, David Parks <[email protected]> wrote: >> >> I don’t know your intended use of these ID’s, my thoughts here are limited >> to assumed use, feel free to ignore thoughts that are off base for your use >> case. >> >> >> >> If, when you query for the IDs you are looking for **all** the IDs, then >> just serialize them into one field and retrieve them as one record and >> de-serialize them in a way that doesn’t require they all fit into memory at >> the same time (a tokenized CSV list is most straight forward example, but >> you can do more compact serializations). >> >> >> >> If you need to query for some subset of these IDs, then storing them in >> the datastore is indeed the way to go I suspect. You can batch many >> inserts/updates. You’ll have a large table, but that isn’t likely to be a >> problem with this data store, but do test it. If lookup times degrade with >> size you could consider partitioning your users into different groups >> (simple example: 1 group of users IDs that end in even #’s, another that >> ends in odd #’s), this can reduce the size of indexes and improve >> performance on some systems (I don’t have personal experience to tell you >> whether this is necessary in this system, but it’s a thought to consider). >> >> >> >> Again, I just offer this as food for thought. If you describe your >> intended access patterns it will probably help guide the discussion. Good >> luck. >> >> >> >> >> >> *From:* [email protected] [mailto: >> [email protected]] *On Behalf Of *nischalshetty >> *Sent:* Tuesday, April 19, 2011 1:15 PM >> *To:* [email protected] >> *Subject:* [google-appengine] Appropriate way to save hundreds of >> thousands of ids per user >> >> >> >> Every user in my app would have thousands of ids corresponding to them. I >> would need to look up these ids often. >> >> Two things I could think of: >> >> 1. Put them into Lists - (drawback is that lists have a maximum capacity >> of 5000(hope I'm right here) and I have users who would need to save more >> than 150,000 ids) >> 2. Insert each id as a unique record in the datastore (too much of data? >> as it would be user * ids of all users). Can I batch put 5000 records at a >> time? Can I batch get at least 100 - 500 records at a time? >> >> Is there any other way to do this? I hope my question's clear. Your >> suggestions are greatly appreciated. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en. >> ------------------------------ >> >> No virus found in this message. >> Checked by AVG - www.avg.com >> Version: 10.0.1209 / Virus Database: 1500/3582 - Release Date: 04/18/11 >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en. >> >> >> >> >> -- >> -Nischal >> >> +91-9920240474 >> >> twitter: NischalShetty <http://twitter.com/nischalshetty> >> >> facebook: Nischal <http://facebook.com/nischal> >> >> >> >> <http://www.justunfollow.com> >> >> <http://www.justunfollow.com> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en.<http://www.justunfollow.com> >> >> <http://www.justunfollow.com> >> ------------------------------ >> <http://www.justunfollow.com> >> >> >> >> No virus found in this message. >> Checked by AVG - www.avg.com <http://www.justunfollow.com> >> >> Version: 10.0.1209 / Virus Database: 1500/3589 - Release Date: >> 04/21/11<http://www.justunfollow.com> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en.<http://www.justunfollow.com> >> >> >> >> <http://www.justunfollow.com> >> >> <http://www.justunfollow.com> >> >> <http://www.justunfollow.com> >> >> <http://www.justunfollow.com> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en.<http://www.justunfollow.com> >> >> <http://www.justunfollow.com> >> ------------------------------ >> <http://www.justunfollow.com> >> >> >> >> No virus found in this message. >> Checked by AVG - www.avg.com <http://www.justunfollow.com> >> >> Version: 10.0.1209 / Virus Database: 1500/3595 - Release Date: >> 04/24/11<http://www.justunfollow.com> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en.<http://www.justunfollow.com> >> >> >> >> >> -- >> Nick Johnson, Developer Programs Engineer, App Engine >> >> <http://www.justunfollow.com> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en.<http://www.justunfollow.com> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en. >> >> >> >> >> -- >> -Nischal >> >> +91-9920240474 >> >> twitter: NischalShetty <http://twitter.com/nischalshetty> >> >> facebook: Nischal <http://facebook.com/nischal> >> >> >> >> <http://www.justunfollow.com> >> >> <http://www.justunfollow.com> >> >> <http://www.justunfollow.com> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to *[email protected]*. >> To unsubscribe from this group, send email to * >> [email protected]*. >> For more options, visit this group at * >> http://groups.google.com/group/google-appengine?hl=en*.<http://www.justunfollow.com> >> >> >> >> >> -- >> Nick Johnson, Developer Programs Engineer, App Engine >> >> <http://www.justunfollow.com> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to *[email protected]*. >> To unsubscribe from this group, send email to * >> [email protected]*. >> For more options, visit this group at * >> http://groups.google.com/group/google-appengine?hl=en*.<http://www.justunfollow.com> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en. >> > > > > -- > -Nischal > twitter: NischalShetty <http://twitter.com/nischalshetty> > facebook: Nischal <http://facebook.com/nischal> > > > > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/google-appengine?hl=en. > -- Nick Johnson, Developer Programs Engineer, App Engine -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
