not applicable to your problem, but interesting enough to share on this thread: http://basepub.dauphine.fr/bitstream/handle/123456789/5260/SD-Rtree.PDF?sequence=2
On Mon Dec 01 2014 at 3:48:14 PM andy petrella <andy.petre...@gmail.com> wrote: > Indeed. However, I guess the important load and stress is in the > processing of the 3D data (DEM or alike) into geometries/shades/whatever. > Hence you can use spark (geotrellis can be tricky for 3D, poke @lossyrob > for more info) to perform these operations then keep an RDD of only the > resulting geometries. > Those geometries won't probably that heavy, hence it might be possible to > coalesce(1, true) to have to whole thing on one node (or if your driver is > more beefy, do a collect/foreach) to create the index. > You could also create a GeoJSON of the geometries and create the r-tree on > it (not sure about this one). > > > > On Mon Dec 01 2014 at 3:38:00 PM Stadin, Benjamin < > benjamin.sta...@heidelberg-mobil.com> wrote: > >> Thank you for mentioning GeoTrellis. I haven’t heard of this before. We >> have many custom tools and steps, I’ll check our tools fit in. The end >> result after is actually a 3D map for native OpenGL based rendering on iOS >> / Android [1]. >> >> I’m using GeoPackage which is basically SQLite with R-Tree and a small >> library around it (more lightweight than SpatialLite). I want to avoid >> accessing the SQLite db from any other machine or task, that’s where I >> thought I can use a long running task which is the only process responsible >> to update a local-only stored SQLite db file. As you also said SQLite (or >> mostly any other file based db) won’t work well over network. This isn’t >> only limited to R-Tree but expected limitation because of file locking >> issues as documented also by SQLite. >> >> I also thought to do the same thing when rendering the (web) maps. In >> combination with the db handler which does the actual changes, I thought to >> run a map server instance on each node, configure it to add the database >> location as map source once the task starts. >> >> Cheers >> Ben >> >> [1] http://www.deep-map.com >> >> Von: andy petrella <andy.petre...@gmail.com> >> Datum: Montag, 1. Dezember 2014 15:07 >> An: Benjamin Stadin <benjamin.sta...@heidelberg-mobil.com>, " >> user@spark.apache.org" <user@spark.apache.org> >> Betreff: Re: Is Spark the right tool for me? >> >> Not quite sure which geo processing you're doing are they raster, vector? >> More >> info will be appreciated for me to help you further. >> >> Meanwhile I can try to give some hints, for instance, did you considered >> GeoMesa <http://www.geomesa.org/2014/08/05/spark/>? >> Since you need a WMS (or alike), did you considered GeoTrellis >> <http://geotrellis.io/> (go to the batch processing)? >> >> When you say SQLite, you mean that you're using Spatialite? Or your db is >> not a geo one, and it's simple SQLite. In case you need an r-tree (or >> related) index, you're headaches will come from congestion within your >> database transaction... unless you go to a dedicated database like Vertica >> (just mentioning) >> >> kr, >> andy >> >> >> >> On Mon Dec 01 2014 at 2:49:44 PM Stadin, Benjamin < >> benjamin.sta...@heidelberg-mobil.com> wrote: >> >>> Hi all, >>> >>> I need some advise whether Spark is the right tool for my zoo. My >>> requirements share commonalities with „big data“, workflow coordination and >>> „reactive“ event driven data processing (as in for example Haskell Arrows), >>> which doesn’t make it any easier to decide on a tool set. >>> >>> NB: I have asked a similar question on the Storm mailing list, but have >>> been deferred to Spark. I previously thought Storm was closer to my needs – >>> but maybe neither is. >>> >>> To explain my needs it’s probably best to give an example scenario: >>> >>> - A user uploads small files (typically 1-200 files, file size >>> typically 2-10MB per file) >>> - Files should be converted in parallel and on available nodes. The >>> conversion is actually done via native tools, so there is not so much big >>> data processing required, but dynamic parallelization (so for example to >>> split the conversion step into as many conversion tasks as files are >>> available). The conversion typically takes between several minutes and a >>> few hours. >>> - The converted files gathered and are stored in a single database >>> (containing geometries for rendering) >>> - Once the db is ready, a web map server is (re-)configured and the >>> user can make small updates to the data set via a web UI. >>> - … Some other data processing steps which I leave away for brevity … >>> - There will be initially only a few concurrent users, but the >>> system shall be able to scale if needed >>> >>> My current thoughts: >>> >>> - I should avoid to upload files into the distributed storage during >>> conversion, but probably should rather have each conversion filter >>> download >>> the file it is actually converting from a shared place. Other wise it’s >>> bad >>> for scalability reasons (too many redundant copies of same temporary >>> files >>> if there are many concurrent users and many cluster nodes). >>> - Apache Oozie seems an option to chain together my pipes into a >>> workflow. But is it a good fit with Spark? What options do I have with >>> Spark to chain a workflow from pipes? >>> - Apache Crunch seems to make it easy to dynamically parallelize >>> tasks (Oozie itself can’t do this). But I may not need crunch after all >>> if >>> I have Spark, and it also doesn’t seem to fit to my last problem >>> following. >>> - The part that causes me the most headache is the user interactive >>> db update: I consider to use Kafka as message bus to broker between the >>> web >>> UI and a custom db handler (nb, the db is a SQLite file). But how >>> about update responsiveness, isn’t it that Spark will cause some lags (as >>> opposed to Storm)? >>> - The db handler probably has to be implemented as a long running >>> continuing task, so when a user sends some changes the handler writes >>> these >>> to the db file. However, I want this to be decoupled from the job. So >>> file >>> these updates should be done locally only on the machine that started the >>> job for the whole lifetime of this user interaction. Does Spark allow to >>> create such long running tasks dynamically, so that when another (web) >>> user >>> starts a new task a new long–running task is created and run on the same >>> node, which eventually ends and triggers the next task? Also, is it >>> possible to identify a running task, so that a long running task can be >>> bound to a session (db handler working on local db updates, until task >>> done), and eventually restarted / recreated on failure? >>> >>> >>> ~Ben >>> >>