Social network data / Graph properties of Riak

2011-11-18 Thread Jeroen van Dijk
Hi all,

I'm currently evaluating whether Riak would fit as the main storage of my
current project, a social network. The reason I am attracted to Riak and
less to a Graph database as main storage is that I want the easy horizontal
scalability and multi-site replication that Riak provides. The only thing I
doubt is whether the key-value/link model of Riak is flexible enough to be
able to store a property graph (http://arxiv.org/abs/1006.2361). I am not
asking whether the querying/graph traversing will be easy; I'm probably
going to use a graph database or a Pregel like platform (e.g.
http://www.goldenorbos.org/) for that problem. I guess my main question is
whether it would be easy/feasible to import and export a property graph in
and from Riak? Has someone done this before?

I realize the above might be too specific, so here are two more questions
that I think are relevant:

- Is there a known upper limit of links that can be stored (I don't want to
add them all at once so 1000 per request is fine,
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2010-March/000786.html
)
- Is there a way to add meta data to links (edges)? E.g. weigths and other
attributes.

Any other ideas or advise are also highly appreciated.

Cheers,

Jeroen
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


RE: Social network data / Graph properties of Riak

2011-11-18 Thread Jeroen van Dijk
I didn't include the riak list...

-- Forwarded message --
From: Jeroen van Dijk 
Date: Fri, Nov 18, 2011 at 8:36 PM
Subject: Re: Social network data / Graph properties of Riak
To: Eric Moritz 




On Fri, Nov 18, 2011 at 5:06 PM, Eric Moritz wrote:

> If you're storing the graph in a graph database where you're simply
> connecting keys to other keys and then you store the actual properties
> in Riak identified by those keys, then yes Riak will be a fine
> solution for you.
>
> For instance, you traverse the graph using a graph db, collecting a
> list of keys you want the properties for and then load the property
> documents for the keys you collected.
>

Is that what you're envisioning?
>
>
Thank you for this suggestion. I might end up with this kind of solution,
but I think it is not completely what I'm envisioning. I would like to have
the power to scale horizontally as Riak can. If I would be depending on a
graph database to store the relationships I think that this will at some
point turn out to be the bottleneck, unless there are property graph
databases I'm unaware of that have a similar way of scaling as Riak does.

So I basically want to be able to store the complete graph/dataset in Riak
without losing any information on edges or vertices. The database should be
able to scale on read and writes horizontally by just adding machines as
Riak is supposed to do. Advanced queries will be done in the graph
database. The graph database should be fed with the data that is stored in
Riak and this process should not be too slow. I'm thinking of to have my
application write to both Riak and a graph database. And I'll want to sync
on set times from Riak to the graph database to make sure no data is lost
in the graph database. I guess this syncing part is also non-trivial for
big datasets.

Does the above make sense to you?

Jeroen


> Eric.
>
> On Fri, Nov 18, 2011 at 10:38 AM, Jeroen van Dijk
>  wrote:
> > Hi all,
> > I'm currently evaluating whether Riak would fit as the main storage of my
> > current project, a social network. The reason I am attracted to Riak and
> > less to a Graph database as main storage is that I want the easy
> horizontal
> > scalability and multi-site replication that Riak provides. The only
> thing I
> > doubt is whether the key-value/link model of Riak is flexible enough to
> be
> > able to store a property graph (http://arxiv.org/abs/1006.2361). I am
> not
> > asking whether the querying/graph traversing will be easy; I'm probably
> > going to use a graph database or a Pregel like platform (e.g.
> > http://www.goldenorbos.org/) for that problem. I guess my main question
> is
> > whether it would be easy/feasible to import and export a property graph
> in
> > and from Riak? Has someone done this before?
> > I realize the above might be too specific, so here are two more questions
> > that I think are relevant:
> > - Is there a known upper limit of links that can be stored (I don't want
> to
> > add them all at once so 1000 per request is fine,
> >
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2010-March/000786.html
> )
> > - Is there a way to add meta data to links (edges)? E.g. weigths and
> other
> > attributes.
> > Any other ideas or advise are also highly appreciated.
> > Cheers,
> > Jeroen
> > ___
> > riak-users mailing list
> > riak-users@lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >
> >
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


RE: Social network data / Graph properties of Riak

2011-11-18 Thread Jeroen van Dijk
And I also didn't include the riak user list for this reply:


On Fri, Nov 18, 2011 at 7:04 PM, Aphyr  wrote:

> Depending on whether you think it will be more efficient to store the
> graph or its dual, consider each node a vertex and write the adjacency list
> as a part of its data. You can store whatever weights, etc. you need on the
> edges there.
>
> Don't use links; they're just a thin layer on top of mapreduce, so there's
> really not much advantage to using them. Links are subject to the HTTP
> header lengths too, so storing more than a thousand on any node is likely
> to break down.
>
>
Thank you for this suggestion. Also thanks for the warning on not using
links for what I want. So you are saying each vertex will have a list of
the other vertices that it is connected to? And save each edge as key/value
pair? Or are you saying each vertex should embed the adjacent edges,
meaning duplicated edges?

I'm guessing you mean the former, because that makes sense to me. So you
would save a graph assuming users and items like the following key/value
pairs:

//Vertices
user1: properties
user2: properties
item1: properties

//Edges
user1-owns-item1: properties
user1-follows-user2: properties
user2-follows-user1: properties

To be able to find the available edges, each vertices would need to
reference the keys of the edges. Is this what you mean?

If so, one more question about a possible problem. Say I have an item with
many many outgoing edges, so it needs to embed these references. This would
make it really costly to fetch this item from Riak I assume, even if you
are only interested in normal properties. Wouldn't that mean you will have
to save the properties seperately from the edges references to it feasible?

Did I grasp what you were proposing Kyle?

Thanks,
Jeroen



> --Kyle
>
>
> On 11/18/2011 07:38 AM, Jeroen van Dijk wrote:
>
>> Hi all,
>>
>> I'm currently evaluating whether Riak would fit as the main storage of
>> my current project, a social network. The reason I am attracted to Riak
>> and less to a Graph database as main storage is that I want the easy
>> horizontal scalability and multi-site replication that Riak provides.
>> The only thing I doubt is whether the key-value/link model of Riak is
>> flexible enough to be able to store a property graph
>> (http://arxiv.org/abs/1006.**2361 <http://arxiv.org/abs/1006.2361>). I
>> am not asking whether the
>> querying/graph traversing will be easy; I'm probably going to use a
>> graph database or a Pregel like platform (e.g.
>> http://www.goldenorbos.org/) for that problem. I guess my main question
>> is whether it would be easy/feasible to import and export a property
>> graph in and from Riak? Has someone done this before?
>>
>> I realize the above might be too specific, so here are two more
>> questions that I think are relevant:
>>
>> - Is there a known upper limit of links that can be stored (I don't want
>> to add them all at once so 1000 per request is fine,
>> http://lists.basho.com/**pipermail/riak-users_lists.**
>> basho.com/2010-March/000786.**html<http://lists.basho.com/pipermail/riak-users_lists.basho.com/2010-March/000786.html>
>> )
>> - Is there a way to add meta data to links (edges)? E.g. weigths and
>> other attributes.
>>
>> Any other ideas or advise are also highly appreciated.
>>
>> Cheers,
>>
>> Jeroen
>>
>>
>>
>> __**_
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/**mailman/listinfo/riak-users_**lists.basho.com<http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
>>
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Social network data / Graph properties of Riak

2011-11-19 Thread Jeroen van Dijk
On Fri, Nov 18, 2011 at 9:28 PM, Aphyr  wrote:

> On 11/18/2011 11:50 AM, Jeroen van Dijk wrote:
>
>> And I also didn't include the riak user list for this reply:
>>
>>
>> On Fri, Nov 18, 2011 at 7:04 PM, Aphyr > <mailto:ap...@aphyr.com>> wrote:
>>
>>Depending on whether you think it will be more efficient to store
>>the graph or its dual, consider each node a vertex and write the
>>adjacency list as a part of its data. You can store whatever
>>weights, etc. you need on the edges there.
>>
>>Don't use links; they're just a thin layer on top of mapreduce, so
>>there's really not much advantage to using them. Links are subject
>>to the HTTP header lengths too, so storing more than a thousand on
>>any node is likely to break down.
>>
>>
>> Thank you for this suggestion. Also thanks for the warning on not using
>> links for what I want. So you are saying each vertex will have a list of
>> the other vertices that it is connected to? And save each edge as
>> key/value pair? Or are you saying each vertex should embed the adjacent
>> edges, meaning duplicated edges?
>>
>
> Depends on whether you want to store the graph or its dual. If you're
> dealing with a sparse DAG where the input keys are likely to be vertex
> names, the natural choice is to store each vertex as an object in riak and
> each outbound edge from it as a property of that vertex.
>
> /users/user1:
>  owns: [item1, ...]
>  follows: [user2, ...]
>
> Of course, this method doesn't work well when you need to follow edges in
> reverse, so you may need to store the reciprocal relationship on target
> nodes:
>
> /items/item1:
>  owner: user1,
>
> /users/user2:
>  followed-by: [user1, ...]
>
> and so forth. That requires two writes and potentially lossy conflict
> resolution strategies, e.g. a follows b but b is not followed by a. We use
> processes which walk the graph continuously and enforce relationship
> integrity as they go. We have a social graph of several hundred million
> objects stored this way in Riak, and it works reasonably well.
>
> Naturally, you'll want to choose an encoding which is fast and
> space-efficient for a large dataset. Depending on your needs, JSON,
> protocol buffers, or fixed-length record entries might be work well.
>
> Also consider the depth of traversals you'll need to perform. It may be
> best to use a graph database like neo4j for deep traversals. You could
> store your primary dataset in Riak and replicate only the important graph
> information to a graph DB via post-commit hooks. That would solve the
> reciprocal consistency problem (at the cost of replication lag), and could
> reduce the amount of data you need to put into the graph DB.
>
> Given that Linkedin has this problem, you might look into their tools as
> well.
>
>
Thanks for your detailed reply. I'm very happy to hear that you are using
Riak for a similar use case. My dataset will be of a similar magnitude. It
will be a graph where edges have properties (e.g. user 1 rates item x with
2 stars), but I think saving these edges as separate objects will work.
I'll use post-commit hooks to get all this data in a graph database
(probably Neo4j) and use this for graph traversing.

A side question Kyle, how do you make sure the graph database is always
(eventually) consistent with Riak? E.g. how do you correct potential errors
during the post-commit hooks for instance? And say this database gets
corrupt how do you recover this graph database and get it in sync again
with Riak?

Cheers,
Jeroen









> --Kyle
>
>  I'm guessing you mean the former, because that makes sense to me. So you
>> would save a graph assuming users and items like the following key/value
>> pairs:
>>
>> //Vertices
>> user1: properties
>> user2: properties
>> item1: properties
>>
>> //Edges
>> user1-owns-item1: properties
>> user1-follows-user2: properties
>> user2-follows-user1: properties
>>
>> To be able to find the available edges, each vertices would need to
>> reference the keys of the edges. Is this what you mean?
>>
>> If so, one more question about a possible problem. Say I have an item
>> with many many outgoing edges, so it needs to embed these references.
>> This would make it really costly to fetch this item from Riak I assume,
>> even if you are only interested in normal properties. Wouldn't that mean
>> you will have to save the properties seperately from the edges
>> references to it feasible?
>>
>> Did I grasp what you were proposing Kyle?
>>
>> Thanks,
&

Using (parts of) Bitcask backend as a cache store

2011-11-29 Thread Jeroen van Dijk
Hi all,

I'm currently investigating of how to structure my data in Riak. I'm
thinking of having buckets that have the purpose of storing the raw data
and having buckets that store certain views on this data to minimize
lookups and mapreduce operations at runtime. So the latter would in effect
be a cache store of these views.

I saw http://wiki.basho.com/Bitcask.html#Automatic-Expiration and I would
conclude that I could use bitcask storage to achieve this. I'm currently
just wondering whether the expiry_secs setting is a global one. Is it
possible to set this expiry per bucket?  Otherwise I would have to give up
the use of Bitcask for other scenarios than caching I assume.

Cheers,
Jeroen
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Using (parts of) Bitcask backend as a cache store

2011-11-29 Thread Jeroen van Dijk
Kresten,

Thank you very much for the quick response and the confirmation of this use
case of Bitcask. I'll soon experiment with this setup.

Cheers,
Jeroen

On Tue, Nov 29, 2011 at 11:28 AM, Kresten Krab Thorup wrote:

> Jeroen,
>
> You can run multiple bitcask backends using the multi_backend, and
> configure them differently (one with a timeout and one without).  That's
> what we do when we need this.  The only issue is that you need to watch the
> number of file descriptors, since even one bitcask is pretty fd-hungry :-)
>
> Kresten
>
>
> Mobile: + 45 2343 4626 | Skype: krestenkrabthorup | Twitter: @drkrab
> Trifork A/S  |  Margrethepladsen 4  | DK- 8000 Aarhus C |  Phone : +45
> 8732 8787  |  www.trifork.com<http://www.trifork.com/>
>
>
>
> On Nov 29, 2011, at 10:37 AM, Jeroen van Dijk wrote:
>
> Hi all,
>
> I'm currently investigating of how to structure my data in Riak. I'm
> thinking of having buckets that have the purpose of storing the raw data
> and having buckets that store certain views on this data to minimize
> lookups and mapreduce operations at runtime. So the latter would in effect
> be a cache store of these views.
>
> I saw http://wiki.basho.com/Bitcask.html#Automatic-Expiration and I would
> conclude that I could use bitcask storage to achieve this. I'm currently
> just wondering whether the expiry_secs setting is a global one. Is it
> possible to set this expiry per bucket?  Otherwise I would have to give up
> the use of Bitcask for other scenarios than caching I assume.
>
> Cheers,
> Jeroen
> ___
> riak-users mailing list
> riak-users@lists.basho.com<mailto:riak-users@lists.basho.com>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Riak search performance FUD

2011-11-30 Thread Jeroen van Dijk
Hi all,

I'm currently evaluating the search functionality of Riak. This involves
porting an application from Postgres/Sphinx to possibly only Riak. The
application I'm porting doesn't need advanced search, but it does need a
level of search I have come to believe this isn't provided in a feasible
way by Riak Search out of the box. I've also seen some sources that make me
worry about the performance of search [1, 2]. I hope to be proved wrong
here or get some advice how to work around this so I can just use Riak
Search and without an external search facility. As a disclaimer, I haven't
done any benchmarks yet and this is just based on what I have read so far.

The use case I'm talking about is when you are looking for a term that is
very common and thus will yield many results. My understanding of the
implementation of Riak [citation needed] is that the search is divided into
a few phases. The first one is collecting results for each term. After that
comes merging, sorting and limiting the result set. So for this particular
case collecting all results would be infeasible and would kill performance.
Even when a limit is set because limiting comes in a phase after collecting
and the merging of results.

The first question is, can the above be confirmed? I've read about Riak
Search performance optimization here [3], but that seems to be for a
different problem.

I've read here [1] that one can use search_fold to interrupt the collecting
phase when enough results are fetched. I would like to know if this a
best/official practice and if it really solves the issue?

I guess what I'm missing is a wiki page of "when and when not to use Riak
Search" or "how and how not to use Riak search". If this already exists I
completely missed it.

Cheers,
Jeroen

[1] http://blog.inagist.com/searching-with-riaksearch
[2]
http://www.productionscale.com/home/2011/11/20/building-an-application-upon-riak-part-1.html#axzz1enL4I6KTl
[3]
http://basho.com/blog/technical/2011/07/18/Boosting-Riak-Search-Query-Performance-With-Inline-Fields/


http://wiki.basho.com/Riak-Search.html
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Riak search performance FUD

2011-12-01 Thread Jeroen van Dijk
FYI I got this reaction from Elias (this is a forward it to the list so it
will be archived correctly. Thank you Elias btw)

On Wed, Nov 30, 2011 at 5:45 PM, Elias Levy wrote:

> On Wed, Nov 30, 2011 at 6:01 AM, wrote:
>
>> From: Jeroen van Dijk 
>>
>> The use case I'm talking about is when you are looking for a term that is
>> very common and thus will yield many results. My understanding of the
>> implementation of Riak [citation needed] is that the search is divided
>> into
>> a few phases. The first one is collecting results for each term. After
>> that
>> comes merging, sorting and limiting the result set. So for this particular
>> case collecting all results would be infeasible and would kill
>> performance.
>> Even when a limit is set because limiting comes in a phase after
>> collecting
>> and the merging of results.
>>
>
> That's correct.  We have similar issues.  We've resorted to  creating the
> equivalent of multicolumn indexes by joining certain fields together and
> indexing those.  That is only possible because most of the data we want to
> index is structured or semi-structured.  You'd have to determine whether
> such an approach is feasible for your purposes.
>
> We also found 2i to be faster than Search, at the expense of requiring our
> app to perform tokenization for some of the fields we want to index, but
> we've stuck with Search as we need composable queries, which 2i does not
> yet provide.
>
> I've read here [1] that one can use search_fold to interrupt the collecting
>> phase when enough results are fetched. I would like to know if this a
>> best/official practice and if it really solves the issue?
>>
>
> Search_fold will only be useful if you plan on developing in Erlang and,
> if my understanding is correct, if you don't care about the order of the
> results (i.e. no scoring or field sorting).  Actually, the results may be
> partially ordered, as the merge_index backend may store the postings sorted
> by the inverse of time.
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Riak search performance FUD

2011-12-01 Thread Jeroen van Dijk
Hi Rusty,

On Wed, Nov 30, 2011 at 5:49 PM, Rusty Klophaus  wrote:

> Hi Jeroen,
>
> Your understanding is correct, the search query is parsed into a tree,
> where each leaf of the tree corresponds to a term. Each leaf sends back all
> matching terms, and results are intersected (or unioned) where the branches
> come together. So yes, if you were to run a search on a term with a large
> number of results, the system reads the entire list of keys (not objects)
> for that result.
>
> You may want to take another look at inline fields. They allow you to
> limit the results at the leaf level, and can greatly improve performance
> for common terms.
>
> The example I generally use to illustrate inline fields is to imagine
> searching for all males living in a specific zip code or postal code. In a
> normal search, a query on zip code would return ~100k results, and a query
> on "male" would return roughly half of the world's population. However, you
> can mark gender as an inline field, and then structure your query as two
> parts: a primary query on the zip code, and a filter on the gender. The
> filter is applied directly after the data is fetched from disk, before it
> is streamed through the rest of the system, so it is a very fast way to
> limit your results.
>
>
I have to test this out. I currently don't see filters that would apply for
this case. I could maybe simplify the problem by ignoring these common
terms for parts that they are common and regard them as stopwords there. So
to make the example more concrete I would allow to search for the term in
titles but not in the description where it is common. I'm guessing this
will be a custom solution where one needs to manipulate the query before
sending it to Riak Search.


> That said, there are currently known issues around sorting and pagination
> in Riak Search, the upshot is that if you apply sorting and pagination at
> the same time, it can give incorrect or unpredictable results; this might
> be something to consider while planning your application. (
> https://issues.basho.com/show_bug.cgi?id=867)
>

Thanks for pointing this out. I'll keep an eye on this issue.



> I would recommend against using search_fold because it could break in the
> future, it is not intended to be a part of the public API.
>

Thanks for this advise :)



> Hope that helps,
>

Definitely, thank you.

Cheers,
Jeoren



> Best,
> Rusty
>
>
>
> On Wed, Nov 30, 2011 at 5:01 AM, Jeroen van Dijk <
> jeroentjevand...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm currently evaluating the search functionality of Riak. This involves
>> porting an application from Postgres/Sphinx to possibly only Riak. The
>> application I'm porting doesn't need advanced search, but it does need a
>> level of search I have come to believe this isn't provided in a feasible
>> way by Riak Search out of the box. I've also seen some sources that make me
>> worry about the performance of search [1, 2]. I hope to be proved wrong
>> here or get some advice how to work around this so I can just use Riak
>> Search and without an external search facility. As a disclaimer, I haven't
>> done any benchmarks yet and this is just based on what I have read so far.
>>
>> The use case I'm talking about is when you are looking for a term that is
>> very common and thus will yield many results. My understanding of the
>> implementation of Riak [citation needed] is that the search is divided into
>> a few phases. The first one is collecting results for each term. After that
>> comes merging, sorting and limiting the result set. So for this particular
>> case collecting all results would be infeasible and would kill performance.
>> Even when a limit is set because limiting comes in a phase after collecting
>> and the merging of results.
>>
>> The first question is, can the above be confirmed? I've read about Riak
>> Search performance optimization here [3], but that seems to be for a
>> different problem.
>>
>> I've read here [1] that one can use search_fold to interrupt the
>> collecting phase when enough results are fetched. I would like to know if
>> this a best/official practice and if it really solves the issue?
>>
>> I guess what I'm missing is a wiki page of "when and when not to use Riak
>> Search" or "how and how not to use Riak search". If this already exists I
>> completely missed it.
>>
>> Cheers,
>> Jeroen
>>
>> [1] http://blog.inagist.com/searching-with-riaksearch
>> [2]
>> http://www.productionscale.com/home/2011/11/20/building-an-applicat