By the way - if you're going this route, see https://github.com/datastax/spark-cassandra-connector
On Thu, Jul 16, 2015 at 2:40 PM Vetle Leinonen-Roeim <[email protected]> wrote: > You'll probably have to install it separately. > > On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker <[email protected]> wrote: > >> Hi Vetle, >> >> IndexedRDD is persisted in the same way RDDs are as far as I am aware. >> Are you aware if Cassandra can be built into my application or has to be a >> stand alone database which is installed separately? >> >> Thanks, >> >> Jem >> >> On Thu, Jul 16, 2015 at 12:59 PM Vetle Leinonen-Roeim <[email protected]> >> wrote: >> >>> Hi, >>> >>> Not sure how IndexedRDD is persisted, but perhaps you're better off >>> using a NOSQL database for lookups (perhaps using Cassandra, with the >>> Cassandra connector)? That should give you good performance on lookups, but >>> persisting those billion records sounds like something that will take some >>> time in any case. >>> >>> Regards, >>> Vetle >>> >>> >>> On Thu, Jul 16, 2015 at 10:02 AM Jem Tucker <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> I have been using IndexedRDD as a large lookup (1 billion records) to >>>> join with small tables (1 million rows). The performance of indexedrdd is >>>> great until it has to be persisted on disk. Are there any alternatives to >>>> IndexedRDD or any changes to how I use it to improve performance with big >>>> data volumes? >>>> >>>> Kindest Regards, >>>> >>>> Jem >>>> >>>
