Re: external http calls from map functions?

2010-04-16 Thread Eric Gaumer
Check out:

http://github.com/kungfooguru/erlastic_search

Which is an Erlang ElasticSearch client. You may also want to look at the
pre/post commit hooks in Riak because they offer a nice integration point
(as opposed to building the index all at once).

Keep in mind that regardless, HTTP has a high overhead when it comes to
message passing semantics. We've achieved indexing speeds of about 10K
docs/sec (over HTTP) but we *always* using batching. I don't think ES has a
batch submission endpoint. In this case SOLR may be a better fit because
your map phase can build/submit batches and reduce the number of HTTP calls
(batching will no doubt provide better performance than mere parallelism of
single document submissions).

Of course you'll have to partition the data yourself which can be quite a
pain when you want to grow/shrink the Solr cluster (i.e., no elasticity).
Adding a batch API to ES shouldn't be too difficult. You might want to hit
Shay up and see what he says.

Of course you could sit tight and wait for Riak Search.

-Eric


On Fri, Apr 16, 2010 at 9:06 AM, Kevin Smith  wrote:

> The only restrictions on map/reduce functions are a) they must return lists
> and b) the entire job must execute before the timeout period elapses (60
> seconds is the default).  Javascript functions have the additional
> restriction of not being able to call back into Erlang code due to the
> current state of Erlang/Javascript integration.
>
> The easiest way to do this would be to write your function in Erlang and
> use either httpc (packaged with Erlang) or ibrowse (
> http://github.com/cmullaparthi/ibrowse) to make the HTTP calls. If you're
> comfortable with Erlang & OTP I'd recommend making a separate OTP
> application to handle the HTTP calls and provide an API for your map/reduce
> functions to use This design moves the HTTP calls out of the query flow and
> prevents a hanging HTTP call from timing out a query.
>
> --Kevin
> On Apr 15, 2010, at 11:20 PM, Colin Surprenant wrote:
>
> > I'll rephrase my question:
> >
> > Is it possible to call external http services from a map function in
> > JavaScript and/or Erlang? Any comments/pointers appreciated.
> >
> > Thanks,
> > Colin
> >
> > On Thu, Apr 15, 2010 at 12:44 PM, Colin Surprenant 
> wrote:
> >> Hi,
> >>
> >> I am trying to figure what the fastest way would be to send a
> >> mapreduce result set for indexing into a searchengine system like
> >> elasticsearch.
> >>
> >> Of course, the trivial way to do it would be to simply gather the
> >> result set and push it back into the indexer using their http/rest
> >> api.
> >>
> >> Now, elasticsearch is distributed by nature and will allow parallel
> >> queries for document insertion for indexing. One way to leverage this
> >> would be to actually directly push a document from within a map
> >> function into the indexer using their rest api. This would completely
> >> distribute the index creation process and leverage the parallelism of
> >> elasticsearch.
> >>
> >> Would this be possible?
> >>
> >> Is this something I could do using the JavaScript mapreduce? and/or
> Erlang?
> >>
> >> Thanks,
> >> Colin
> >>
> >
> > ___
> > riak-users mailing list
> > riak-users@lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: external http calls from map functions?

2010-04-16 Thread Eric Gaumer
On Fri, Apr 16, 2010 at 10:04 AM, Colin Surprenant wrote:

>
> I definitely have to do some experimentation with this idea and see if
> adding such triggers directly from map functions can have a
> significant impact on the data post-processing throughput out of the
> mapreduce framework. Another option, in line with your separate OTP
> application suggestion, would be to use intermediate queuing
> (rabbitmq, redis, ...) and just queue results which would be picked up
> by another external process in charge of feeding into the
> elasticsearch indexer. The process can then be tuned independently to
> parallelize documents inserts and optimize this for your specific
> elasticsearch cloud characteristics.
>
> I think this approach could be more efficient, in the case of very
> large result sets, than doing a simple result set aggregation and
> re-feeding. There is also the option of chunked/streaming results set
> to consider. In the same line of thoughts I could just setup a
> listener on the result stream and feed it back into the intermediate
> queuing.
>

We're prototyping an ingest framework using RabbitMQ and RabbitHub
(PubSubHubBub & Webhooks) for a client. The idea is to allow internal
applications to publish documents and then have various back-ends subscribe
to those syndications via Webhooks using RabbitHub.

We've seen promising results but we still don't use this to "bootstrap" any
systems. The problem we're trying to solve is to reduce all the
point-to-point communication (decouple things), eliminate batch oriented
behavior (near real-time trickle feeding), and provide some sense of
durability when a backend is offline for any reason (apps just keep
feeding).

Check out RabbitHub which provides a general implementation of PubSubHubBub
(more than just Atom). Then you can just use a Webhook to have any new
documents published to the broker automatically published to ES by adding ES
as a subscriber. Very cool stuff

http://github.com/tonyg/rabbithub

Regards,
-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Riak 0.10rc1 Startup Issues

2010-04-20 Thread Eric Gaumer
Build seems to have went fine but I get a core dump when I start riak.

Slogan: Kernel pid terminated (application_controller)
({application_start_failure,erlang_js,{{bad_return_value,{error,{load_error,[70,97,105,108,101,100,32,116,111,32,1
System version: Erlang R13B04 (erts-5.7.5) [source] [64-bit] [smp:2:2]
[rq:2] [async-threads:5] [hipe] [kernel-poll:true]

This is on OS X 10.6.

Has anyone pulled off a successful build using R13B04?

The only thing (I see) out of the ordinary is some linker warnings while
compiling erlang_js:

ld: warning: in
/usr/local/lib/erlang/lib/erl_interface-3.6.5/lib/liberl_interface.a, file
is not of required architecture
ld: warning: in /usr/local/lib/erlang/lib/erl_interface-3.6.5/lib/libei.a,
file is not of required architecture
ld: warning: in c_src/system/lib/libjs.a, file is not of required
architecture
ld: warning: in c_src/system/lib/libnspr4.a, file is not of required
architecture

gcc version 4.0.1 (Apple Inc. build 5493)

-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Riak 0.10rc1 Startup Issues

2010-04-20 Thread Eric Gaumer
On Tue, Apr 20, 2010 at 2:23 PM, Kevin Smith  wrote:

> Do you have 64 or 32 bit Erlang installed? My guess is erlang_js is
> compiling the Spidermonkey VM for 64 bit but you have 32 bit Erlang
> installed.
>

I'm running a 64 bit version of Erlang. Could it have something to do with
hipe being enabled?

Erlang R13B04 (erts-5.7.5) [source] [64-bit] [smp:2:2] [rq:2]
[async-threads:0] [hipe] [kernel-poll:false]

-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Riak 0.10rc1 Startup Issues

2010-04-20 Thread Eric Gaumer
On Tue, Apr 20, 2010 at 3:01 PM, Kevin Smith  wrote:

> Hipe shouldn't make a difference. The linker errors:
>
> ld: warning: in
> /usr/local/lib/erlang/lib/erl_interface-3.6.5/lib/liberl_interface.a, file
> is not of required architecture
> ld: warning: in /usr/local/lib/erlang/lib/erl_interface-3.6.5/lib/libei.a,
> file is not of required architecture
> ld: warning: in c_src/system/lib/libjs.a, file is not of required
> architecture
> ld: warning: in c_src/system/lib/libnspr4.a, file is not of required
> architecture
>
> indicate the "bit-ness" of your Erlang install and the Spidermonkey and
> Netscape runtime do not agree. This causes the subsequent error when Riak
> tries to load the shared library into the Erlang VM.
>
> How old is the Mac you're using? I've also seen problems where Erlang is
> compiled for 64 bits but Spidermonkey's build detects 32 bit based on the
> hardware.



Sweet. So this got me pointed in the right direction.

The NS and SM libs that were being generated were in fact 32 bit. After some
digging around in the underlying Makefiles and rebar.conf, everything looked
proper. Watched the build run by and all the appropriate flags were being
used but I was still getting 32 bit libs. WTF?

It turns out that Apples version of gcc 4.0.1 only supports i386, ppc, ppc64
as -arch flags. So -arch x86_64 was being ignored. I switched to gcc 4.2.1
and everything builds and runs fine.

I had been using 4.0.1 due to some bugs in building Python C extension
modules (upstream Python builds use 4.0.1 on OS X).

Thanks for the pointers,

-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: bitbucket problems today and current code tree

2010-04-20 Thread Eric Gaumer
On Tue, Apr 20, 2010 at 6:28 PM, richard bucker wrote:

> I cloned riak from HG this evening and started to build.  That's when I got
> the following errors.  It seems to me that the URL of the dependencies are
> wrong.
>
> /r
>
>
> rbuc...@ubudev:riak$(default)$ make rel
> ./rebar compile generate
> ==> luke (compile)
> Compiled src/luke_flow_sup.erl
> Compiled src/luke_flow.erl
> Compiled src/luke_phase.erl
> Compiled src/luke_phase_sup.erl
> Compiled src/luke.erl
> Compiled src/luke_sup.erl
> Compiled src/luke_phases.erl
> ==> riak_core (compile)
> Dependency not available: webmachine-"1.*" ({hg,
>  "
> http://bitbucket.org/basho/webmachine";,
>  "139"})
> Dependency not available: mochiweb-"0.02" ({hg,
> "
> http://bitbucket.org/basho/mochiweb";,
> "115"})
> make: *** [rel] Error 1
>

Bitbucket is back online now. Run `make all rel` instead.

-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Protobufs & HTTP

2010-04-20 Thread Eric Gaumer
If I add some data using the protobufs interface, can I retrieve that data
over HTTP? I'm getting an error when I try to GET the id but the return_body
looks proper.

=
import socket
from struct import *
import riakclient_pb2

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 8087))

request = riakclient_pb2.RpbPutReq()
request.bucket = 'test'
request.key = '1'
request.content.value = '{"foo":"bar"}'
request.return_body = True

data = request.SerializeToString()
msg_len = len(data)

s.send(pack('!lb%is' % msg_len, msg_len + 1, 11, data))

(length, code) = unpack('!lb', s.recv(5))
pbm = s.recv(length-1)
s.close()

resp = riakclient_pb2.RpbPutResp()
resp.ParseFromString(pbm)
print 'Return Body:', resp.contents[0].value
=

egau...@deus:(src)$ python riakclient.py
Return Body: {"foo":"bar"}

=
egau...@deus:(src)$ curl -v -H "Content-Type: application/json"
http://127.0.0.1:8098/riak/test/1
* About to connect() to 127.0.0.1 port 8098 (#0)
*   Trying 127.0.0.1... connected
* Connected to 127.0.0.1 (127.0.0.1) port 8098 (#0)
> GET /riak/test/1 HTTP/1.1
> User-Agent: curl/7.19.4 (universal-apple-darwin10.0) libcurl/7.19.4
OpenSSL/0.9.8l zlib/1.2.3
> Host: 127.0.0.1:8098
> Accept: */*
> Content-Type: application/json
>
< HTTP/1.1 500 Internal Server Error
< Server: MochiWeb/1.1 WebMachine/1.6 (eat around the stinger)
< Date: Wed, 21 Apr 2010 03:41:25 GMT
< Content-Type: text/html
< Content-Length: 1430
<
500 Internal Server
ErrorInternal Server ErrorThe server
encountered an error while processing this
request:[{webmachine_decision_core,'-decision/1-lc$^1/1-1-',
 [{error,
  {error,badarg,
  [{dict,fetch,
   [<<"content-type">>,
{dict,2,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],[],[],[],[],[],[],
  [[<<"X-Riak-VTag">>,52,70,122,120,114,66,87,55,51,
105,99,74,114,50,86,89,72,122,101,55,70,82]],
  [],[],

 [[<<"X-Riak-Last-Modified">>|{1271,821233,40580}]],
  [],[]}}}]},
   {riak_kv_wm_raw,content_types_provided,2},
   {webmachine_resource,resource_call,3},
   {webmachine_resource,do,3},
   {webmachine_decision_core,resource_call,1},
   {webmachine_decision_core,decision,1},
   {webmachine_decision_core,handle_request,2},
   {webmachine_mochiweb,loop,1}]}}]},
 {webmachine_decision_core,decision,1},
 {webmachine_decision_core,handle_request,2},
 {webmachine_mochiweb,loop,1},
 {mochiweb_http,headers,5},
* Connection #0 to host 127.0.0.1 left intact
* Closing connection #0
 {proc_lib,init_p_do_apply,3}]mochiweb+webmachine web
server
=

-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Protobufs & HTTP

2010-04-20 Thread Eric Gaumer
On Tue, Apr 20, 2010 at 11:49 PM, Jon Meredith  wrote:

>  Hi Eric,
>
> For the HTTP interface to work correctly you need to supply a content type
> when you do your updated - if you add something like
>
> request.content.content_type = 'application/json'
>
> Then you should be in business.  The python client at
> http://bitbucket.org/basho/riak-python-client includes support for
> protocol buffers now if you'd like to take a look at that.
>


Thanks Jon, I knew it had to be something simple I was missing.

-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: one-to-very-many link associations

2010-04-22 Thread Eric Gaumer
Don't fall into the trap of "one size fits all". Riak is an amazing product
that can solve a number of tough problems. I don't think this is one of
them. You need/want a triple store to (correctly) model this sort of
problem. You need flexible schemas, ontologies, and a graph based query
language.

Take a look at: http://www.bigdata.com/

Regards,
-Eric


On Thu, Apr 22, 2010 at 4:23 AM, Orlin Bozhinov  wrote:

>  Riak Users,
>
> Thinking about a data modeling pattern that will allow one to not worry
> about how many links can be had with one-to-many (or many-to-many)
> scenarios.  This question has come up before in various places.  One answer
> I like is Sean's from this thread
> http://riak.markmail.org/thread/6e7ypt5ndjzjk7mr saying: "... at the point
> where you have that many links it becomes necessary to consider other
> options, including intermediary objects or alternative ways of representing
> the relationship".  I wonder if an _intermediary way_ could be baked into
> Ripple (or your client library of choice).  This is for the cases when
> one-to-many can become one-to-very-many.
>
> To make it more interesting, let's say we want to add metadata to the
> relationship as described in the pre-last paragraph of
> http://blog.basho.com/2010/03/25/schema-design-in-riak---relationships/.
> Here is what I have in mind: {from}->{from_association}->{association}->{to}
> -- the {curlied} are bucket / objects and -> are links.  For example if
> {from} = "user"; and {association} = "interest"; and {to} = {whatever} there
> is interest in - e.g. "event", "place", "story", another "user" or even
> self-interest :)  But I'm getting ahead of myself.  Let's use a recent
> example from Basho's blog where a "user" links {to} = "task".  So we get:
> user -has--> user_interest -meta--> interest -in--> task.
>
> The "interest" association could imply "ownership" but maybe the
> application allows its "users" to express interest another's "task".  Maybe
> it's a collaborative effort...  Reverse-linking from the many interests /
> tasks to their respective owners is easy because it's just a single link for
> task -of--> user or interest -of--> user.  In the interests bucket I want to
> put all kinds of useful metadata.  There I would embed (via Composition as
> Ripple calls it) not only all the "tags", but also "notes", "star", etc.
> Think delicious bookmarks or google reader items and so on.  It seems like a
> common pattern.  Something that may fit the use case of @botanicus too.  One
> could represent all possible links (various associations) between two
> objects as metadata contained in a given "interest".  Ownership can be a
> type of interest for the sake of link-walking.
>
> There are three things happening here:
> 1. the "very many" (links through intermediary objects)
> 2. optional metadata (yet another intermediary object) - multiple
> associations between any two objects can be expressed through extra metadata
> rather than extra links
> 3. reusing the "very-many" and / or metadata intermediaries -linking--> to
> objects in different buckets
>
> The real issue (that #1 solves) is not having an easy ability to do "very
> many" links originating from the same object.  The #2 metadata object vs a
> few extra links for tags / notes (which are insignificant compared to the
> many interests a user can have) - makes it easier (in my eyes) to put in
> Redis for filtering...  Of-course interests (#2) could be specialized
> (different metadata models) with regards to what they are about (#3).  On
> delicious that's just bookmarks.  I've got close to 6,000 of them.  Does
> that approach "very many" in terms of Riak?  If "very many" were easier to
> do (with client-library models or otherwise Riak itself) #2 & #3 would be
> indifferent about which intermediary leads to them (an extra link-walk step)
> as they are already possible anyway.  How could we step (automagically)
> through an intermediary object (the user_interest "very many" enabler
> bucket) - having a specific target object in mind?
>
> I think it may already be possible with current link-walking.  Then it's
> all a matter of managing the intermediary bucket / objects.  Not exactly
> sure how the max links are calculated.  According to one formula from the
> mailing list I may get 1000 headers (limit in mochiweb) * 200 links ("around
> 40 chars per link") = 200,000 links max?  That seems like "very many", but
> there was also something about performance burden...  If we took those 200
> with just a single header, pointing to 200 intermediary objects, each
> pointing to another 200 target objects we would get 40,000 links.  That's
> quite a few.  Of-course that number could easily get much much bigger
> (square the default limit).  What decides how many links per intermediary
> object is ideal?  Is it a setting that Basho could recommend a default for?
> Could Ripple automate that?  Some link creation logic is needed and if Riak
> d

Re: one-to-very-many link associations

2010-04-22 Thread Eric Gaumer
On Thu, Apr 22, 2010 at 9:01 AM, Sean Cribbs  wrote:

>
> One of the "other options" that I didn't mention was a graph database.  If
> your model seems to beg lots and lots of links, you might be better off
> looking at something that fits the traversal model better, like Neo4J,
> AllegroGraph, etc.  You could still use Riak for storing the primary
> objects, but keep your tightly interrelated stuff in the graph DB.
>  Remember, nosql is about choice and using the best tool for the job!
>
>
Well said and very classy of you to point out.

Regards,
-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: one-to-very-many link associations

2010-04-22 Thread Eric Gaumer
On Thu, Apr 22, 2010 at 4:47 PM, Orlin Bozhinov  wrote:

>  Eric,
>
> Thanks for the bigdata.com link.  It's something I had missed spotting.
>
> The Semantic Web doesn't fit it all either.  I anticipate to much rather
> search riak (with its upcoming query language) than rely entirely on
> sparql.  Though I've always had the option in mind.  If you look at my
> previous reply you'll see I'm considering it.  I wonder what you think about
> the combined idea...
>
> If riak won't also become a real graph backend, then I'll likely go with an
> additional hosted database.  It could be a relational db (the most readily
> available kind), mongodb (i.e. mongohq), talis, etc.
>
>

Given today's landscape, and lessons learned, anyone who isn't thinking in
terms of diverse storage infrastructures isn't listening. There is no single
magic solution and in fact, the NoSQL ideology frees us from that mentality
(directed toward the relational database).

So yes, a combination of diverse components sounds like a reasonable
architecture.

Is it the ideal architecture? That's something you'll have to work out given
your technical requirements.

In terms of storage adapters, you're misunderstanding things. What they're
referring to is the ability to get RDF statements in/out of the these
backends. This doesn't imply that they transform them into graph databases.
The graph aspect would actually be something that RDF.rb would have to
provide on top of the underlying storage mechanism (via a SPARQL query
engine).

Without such, you have no means of traversal. RDF.rb does not support RDFS,
OWL, or SPARQL. It's essentially an RDF serializer that can retrieve/store
RDF statements via storage adapters. Not to imply that's a bad thing, you're
just not going to get graph like behavior from it.

Regards,
-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: m/r error "All nodes failed"

2010-04-24 Thread Eric Gaumer
On Sat, Apr 24, 2010 at 7:31 AM, Alexander Sicular wrote:

> Hi,
>
> Here is some curl output that should get me a 200 OK, but instead I'm
> getting a 500:
>
> macpro1:dev siculars$ curl -i http://localhost:8098/riak/test-bucket-1/10
> HTTP/1.1 200 OK
> X-Riak-Vclock: a85hYGBgymDKBVIsbLoK3hlMiYx5rAytlu5H+SDCbM1JbOZG+VCJXpBEFgA=
> Vary: Accept-Encoding
> Server: MochiWeb/1.1 WebMachine/1.6 (eat around the stinger)
> Link: ; rel="up"
> Last-Modified: Sat, 24 Apr 2010 09:44:45 GMT
> ETag: 5RdHI4GyS3P3d9BovqQEcs
> Date: Sat, 24 Apr 2010 11:19:04 GMT
> Content-Type: application/json
> Content-Length: 84
>
>
> {"numVal":65486915222,"staticStringVal":"test_string","dynStringVal":"'roeNiGNJCS'"}
>
> macpro1:dev siculars$ curl -v -d '{"inputs":"test-bucket-1",
> "query":[{"map":{"language":"javascript","bucket":"test-bucket-1","key":"10",
> "keep":true}}]}' -H "Content-Type: application/json"
> http://127.0.0.1:8098/mapred
>
>

Not sure with regards to the airport-test itself but in your rmapred call
above, "test-bucket-1/10" should point to a Javascript function (stored in
that riak obj).

Regards,
-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: building on OS X 10.6.2 Dependency not available: webmachine-"1.*" .. Dependency not available: mochiweb-"0.02

2010-04-24 Thread Eric Gaumer
On Sat, Apr 24, 2010 at 11:51 PM, Norman Khine  wrote:

> hello,
> i am trying to build riak on osx, but get the following dependency issues:
>
> $ make rel
> ./rebar compile generate
> ==> luke (compile)
> Compiled src/luke.erl
> Compiled src/luke_flow_sup.erl
> Compiled src/luke_phase.erl
> Compiled src/luke_flow.erl
> Compiled src/luke_phase_sup.erl
> Compiled src/luke_sup.erl
> Compiled src/luke_phases.erl
> ==> riak_core (compile)
> Dependency not available: webmachine-"1.*" ({hg,
>
> "http://bitbucket.org/basho/webmachine";,
> "139"})
> Dependency not available: mochiweb-"0.02" ({hg,
>
> "http://bitbucket.org/basho/mochiweb";,
>"115"})
> make: *** [rel] Error 1
>
>
> any help much appreciated.
>
>
Run 'make all rel'

-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Riak Search release date

2010-05-13 Thread Eric Gaumer
On Thu, May 13, 2010 at 11:02 AM, Senthilkumar Peelikkampatti <
senthilkumar.peelikkampa...@gmail.com> wrote:

> Hi,
>   It has been while Basho announced about Riak Search (I guess it is
> almost an year back, if I remember correctly), do you have any time frame of
> release? Or will it be available only to the Enterprise DS customer?
>


I've been holding back on the same question. I've got a Fortune 100 client
that I'm currently working with to prototype a system for doing distributed
transaction analysis over several billion transactions. We're obviously
leveraging the Map/Reduce framework in Riak. At times, our goal is to run
stats across the whole data set but we'd also like to run against subsets
that match some search criteria. Using filters in the map/reduce job is a
solution but requires visiting every document.

Riak Search may be able to help. We've got a generous amount of funding due
to the complexity and urgency of the problem. Since we're prototyping, we
can live with bugs. Is the beta trial closed? If we can put a solution
together using Riak then my client will be looking at the enterprise support
and monitoring.

Regards,
-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Riak Search release date

2010-05-21 Thread Eric Gaumer
On Fri, May 21, 2010 at 12:02 PM, Mark Phillips  wrote:

> Hey All,
>
> As promised, here is the blog post on Riak Search.
>
> http://blog.basho.com/2010/05/21/riak-search/
>
>
Nice. Thanks for the update Mark.

It sounds like you're using Java in places (e.g., Lucene analyzers). Is that
assumption correct? I saw John mention work of an Erlang NIF to run an
embedded JVM. Probably unrelated but one has to wonder.

-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: migration from mysql

2010-08-24 Thread Eric Gaumer
On Tue, Aug 24, 2010 at 10:42 AM, Jonathan Moore wrote:

> Hello there,
>
> I am new to Riak, but we are thinking of migrating some of our data from
> mysql into it and running with it for some of our website.
>
> Temporarily we would need to keep the data in sync whilst we make other
> changes. So for some time we would be using riak in parallel and
> synchronising the data. So there are two processes we need to create:
>
> 1) full data import
> 2) synchrinising changes to the data
>
> We use solr which has a very usable DataImport handler to get many millions
> of mysql rows indexed, we also use this for delta-imports based on lists of
> unique IDs. Is there any similar technique for Riak? We have 16 million
> documents and counting, so we would rather not open a socket and push over
> HTTP. Currently the dataimporter selects them, and indexes in about 2 hours
> which, as we don't do this often, we can live with. Incremental
> synchronisation would be much smaller sets of documents (<1000 per 10 min)
> so I am less worried there.
>
> I have seen the PBC API which looks promising but I'd still need to fetch
> the rows and push. Does the node you connect to handle the consistent
> hashing in this case? Are there any benchmarks for this?
>
> Is there anything else out there for migrating this amount of data?
>
>

We specialize in enterprise search where this sort of data
integration/consolidation is common practice. Given Riak's distributed
architecture, you should be able to achieve excellent write capacity.

I would definitely use the Protocol Buffers interface if you're more
interested in performance. You can write a simple connector that iterates
over the rows in the database and uses the PBC API to publish to Riak. If
you need more throughput try adding more threads in your client.

We typically do a lot of data transformation from source to target (entity
recognition, classification, normalization, etc.) so our bottleneck is
usually the CPU bound transformation pipeline. With that said we typically
write to a cluster of transformation nodes in order to distribute the work
and maintain write throughput to the target system.

We designed an asynchronous event driven transformational data integration
tool called pypes  which is open source. We have a
Riak publisher component that leverages the PB interface. I haven't done any
official benchmarking but it's noticeably faster than using HTTP. At the
moment, we're doing some prototype work with Riak under an NDA so I can't
provide much detail.

On the initial bootstrap, be sure to tune your write quorum.

 Regards,
-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com