Hi Cody,
Thanks for the reply. Yea, we thought of possibly doing this in a UDX in
Vertica somehow to get the lower level co-operation but its a bit daunting.
We want to do this because there are things we want to do with the
result-set in Spark that are not possible in Vertica. The DStream receive
Have you already tried using the Vertica hadoop input format with spark? I
don't know how it's implemented, but I'd hope that it has some notion of
vertica-specific shard locality (which JdbcRDD does not).
If you're really constrained to consuming the result set in a single
thread, whatever proce
Yes exactly.
The temp table is an approach but then we need to manage the deletion of it etc.
I'm sure we won't be the only people with this crazy use case.
If there isn't a feasible way to do this "within the framework" then that's
okay. But if there is a way we are happy to write the code an
What you're saying is that, due to the intensity of the query, you need
to run a single query and partition the results, versus running one
query for each partition.
I assume it's not viable to throw the query results into another table
in your database and then query that using the normal app
Jorn: Vertica
Cody: I posited the limit just as an example of how jdbcrdd could be used least
invasively. Let's say we used a partition on a time field -- we would still
need to have N executions of those queries. The queries we have are very
intense and concurrency is an issue even if the the
What database are you using?
Le 28 févr. 2015 18:15, "Michal Klos" a écrit :
> Hi Spark community,
>
> We have a use case where we need to pull huge amounts of data from a SQL
> query against a database into Spark. We need to execute the query against
> our huge database and not a substitute (Spa
I'm a little confused by your comments regarding LIMIT. There's nothing
about JdbcRDD that depends on limit. You just need to be able to partition
your data in some way such that it has numeric upper and lower bounds.
Primary key range scans, not limit, would ordinarily be the best way to do
that
Hi Spark community,
We have a use case where we need to pull huge amounts of data from a SQL
query against a database into Spark. We need to execute the query against
our huge database and not a substitute (SparkSQL, Hive, etc) because of a
couple of factors including custom functions used in the