Hi,
I use HCatalog Streaming Mutation API to write data to hive transactional
table, and then, I use SparkSQL to read data from the hive transactional
table. I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional
table, beacause SparkSQL rea
Hi Paras,Check out the link Spark Scala: DateDiff of two columns by hour or
minute
|
|
|
| ||
|
|
|
| |
Spark Scala: DateDiff of two columns by hour or minute
I have two timestamp columns in a dataframe that I'd like to get the minute
difference of, or alternatively
Hi Khaled,
I have attached the spark streaming config below in (a).
In case of the 100vcore run (see the initial email), I used 50 executors
where each executor has 2 vcores and 3g memory. For 70 vcore case, 35
executors, for 80 vcore case, 40 executors.
In the yarn config (yarn-site.xml, (b) bel
I just discovered https://issues.apache.org/jira/browse/SPARK-25738 with
some more testing. I only marked it as critical, but seems pretty bad --
I'll defer to others opinion
On Sat, Oct 13, 2018 at 4:15 PM Dongjoon Hyun
wrote:
> Yes. From my side, it's -1 for RC3.
>
> Bests,
> Dongjoon.
>
> On
Hi Peter,
What parameters are you putting in your spark streaming configuration? What
are you putting as number of executor instances and how many cores per
executor are you setting in your Spark job?
Best,
Khaled
On Mon, Oct 15, 2018 at 9:18 PM Peter Liu wrote:
> Hi there,
>
> I have a syste
Hi there,
I have a system with 80 vcores and a relatively light spark streaming
workload. Overcomming the vcore resource (i.e. > 80) in the config (see (a)
below) seems to help to improve the average spark batch time (see (b)
below).
Is there any best practice guideline on resource overcommit wit
Hi Fokko
Spark fires it off for many other things. It does so for ML pipelines and
it does make information available for data frames.
We use S3 in this case I just simplified the example. It is important to
know what process took what action. Only spark knows this and it does
supply this informa
Hi Bolke,
I would argue that Spark is not the right level of abstraction of doing
this. I would create a wrapper around the particular filesystem:
http://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html
Therefore you can write a wrapper around the LocalFileSystem if data will
Hi,
Apologies upfront if this should have gone to user@ but it seems a developer
question so here goes.
We are trying to improve a listener to track lineage across our platform. This
requires tracking where data comes from and where it goes to. E.g.
sc.setLogLevel("INFO");
val data = sc.textF
i realize it is unlikely all data will be local to tasks, so placement will
not be optimal and there will be some network traffic, but is this the same
as a shuffle?
in CoalesceRDD it shows a NarrowDependency, which i thought meant it could
be implemented without a shuffle.
On Mon, Oct 15, 2018 a
10 matches
Mail list logo