We need your help to make the Apache Washington DC Roadshow on Dec 4th a
success.
What do we need most? Speakers!
We're bringing a unique DC flavor to this event by mixing Open Source
Software with talks about Apache projects as well as OSS CyberSecurity,
OSS in Government and and OSS Career
That only works assuming that Spark is the only client of the table. It
will be impossible to force an outside user to respect the special metadata
table when reading so they will still see all of the data in transit.
Additionally this would force the incoming data to only be written into new
parti
So if Spark and the destination datastore are both non-transactional, you will
have to resort to an external mechanism for “transactionality”.
Here are some options for both RDBMS and non-transaction datastore destination.
For now assuming that Spark is used in batch mode (and not streaming mode)
I'm still not sure how the staging table helps for databases which do not
have such atomicity guarantees. For example in Cassandra if you wrote all
of the data temporarily to a staging table, we would still have the same
problem in moving the data from the staging table into the real table. We
woul
I was trying to enable CBO on one of our jobs (using Spark 2.3.1 with
partitioned parquet data) but it seemed that the rowCount statistics were
being ignored. I found this JIRA which seems to describe the same issue:
https://issues.apache.org/jira/browse/SPARK-25185, but it has no response so
far.
>Some being said it is exactly-once when the output is eventually
exactly-once, whereas others being said there should be no side effect,
like consumer shouldn't see partial write. I guess 2PC is former, since
some partitions can commit earlier while other partitions fail to commit
for some time.
Y
Hello,
I recently start studying the Spark's memory management system.
More spesifically I want to understand how spark use the off-Heap memory.
Interanlly I saw, that there are two types of offHeap memory.
(offHeapExecutionMemoryPool and offHeapStorageMemoryPool).
How Spark use the of
Hi,
Thanks all for the comments and discussion regarding the API! It sounds
like the current expectation for database systems is to populate a staging
table in the tasks and the driver moves that data when commit is called.
That would work for many usecases that our users have with the MongoDB
co