when I try to do a Broadcast Hash Join on a bigger table (6Mil rows) I get
an incorrect result of 0 rows.
val rightDF = spark.read.format("parquet").load("table-a")
val leftDF = spark.read.format("parquet").load("table-b")
//needed to activate dynamic pruning subquery
.where('part_ts === 2021
Hello,
I'm using spark-3.0.0-bin-hadoop3.2 with custom hive metastore DB
(postgres). I'm setting the "autoCreateAll" flag to true, so hive is
creating its relational schema on first use. The problem is there is a
deadlock and the query hangs forever:
*Tx1* (*holds lock on TBLS relation*, wait_even
I forgot to mention important part that I'm issuing same query to both
parquets - selecting only one column:
df.select(sum('amount))
BR,
Tomas
št 19. 9. 2019 o 18:10 Tomas Bartalos napísal(a):
> Hello,
>
> I have 2 parquets (each containing 1 file):
>
>- parquet
Hello,
I have 2 parquets (each containing 1 file):
- parquet-wide - schema has 25 top level cols + 1 array
- parquet-narrow - schema has 3 top level cols
Both files have same data for given columns.
When I read from parquet-wide spark reports* read 52.6 KB*, from
parquet-narrow *only 2.6 K
Hello,
I have 2 parquet tables:
stored - table of 10 M records
data - table of 100K records
*This is fast:*
val dataW = data.where("registration_ts in (20190516204l,
20190515143l,20190510125l, 20190503151l)")
dataW.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(dataW), Seq("registr
ean Owen, wrote:
> A cached DataFrame isn't supposed to change, by definition.
> You can re-read each time or consider setting up a streaming source on
> the table which provides a result that updates as new data comes in.
>
> On Fri, May 17, 2019 at 1:44 PM Tomas Ba
Hello,
I have a cached dataframe:
spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache
I would like to access the "live" data for this data frame without deleting
the cache (using unpersist()). Whatever I do I always get the cached data
on subsequent queries. Even addi
Hello,
I have partitioned parquet files based on "event_hour" column.
After reading parquet files to spark:
spark.read.format("parquet").load("...")
Files from the same parquet partition are scattered in many spark
partitions.
Example of mapping spark partition -> parquet partition:
Spark partit
Hello,
I've contributed a PR https://github.com/apache/spark/pull/23749/. I think
it is an interesting feature that might be of use by lot of folks from
Kafka community. Our company already uses this feature for real time
reporting based on Kafka events.
I was trying to strictly follow the contri
is more like a one-time SQL statement.
>> Kafka doesn't support predicates how it's integrated with spark. What can
>> be done from spark perspective is to look for an offset for a specific
>> lowest timestamp and start the reading from there.
>>
>> BR,
>>
Hello Spark folks,
I'm reading compacted Kafka topic with spark 2.4, using direct stream -
KafkaUtils.createDirectStream(...). I have configured necessary options for
compacted stream, so its processed with CompactedKafkaRDDIterator.
It works well, however in case of many gaps in the topic, the pr
11 matches
Mail list logo