Hello guys,
I use spark streaming to receive data from kafka and need to store the data
into hive. I see the following ways to insert data into hive on the Internet:
1.use tmp_table
TmpDF=spark.createDataFrame(RDD,schema)
TmpDF.createOrReplaceTempView('TmpData')
Hi All,
How can we set spark conf parameter in databricks notebook? My cluster
doesn't take into account any spark.conf.set properties... it creates 8
worker nodes (dat executors) but doesn't honor the supplied conf
parameters. Any idea?
--
Regards,
Rishi Shah
Hi All,
At times when there's a data node failure, running spark job doesn't fail -
it gets stuck and doesn't return. Any setting can help here? I would
ideally like to get the job terminated or executors running on those data
nodes fail...
--
Regards,
Rishi Shah
Pyarrow version 0.12.1, arrow jar version 0.10, can run correctly. Pyarrow
version 0.121, arrow jar version 0.12, this exception occurs:
> *Expected schema message in stream, was null or length 0*
Pyarrow was upgraded to version 0.12.1 by other package dependency,
resulting in inconsistency betw
This has been fixed and was included in the release 0.3 last week. We will be
making another release (0.4) in the next 24 hours to include more features also.
On Tue, Apr 30, 2019 at 12:42 AM, Manu Zhang < owenzhang1...@gmail.com > wrote:
>
> Hi,
>
>
> It seems koalas.DataFrame can't be displ
Thanks for reply, Jean
In my project , I'm working on higher abstraction layer of spark
streaming to build a data processing product and trying to provide a common
api for java and scala developers.
You can see the abstract class defined here:
https://github.com/InterestingLab/waterdrop
The slides for the talks have been uploaded to
https://analytics-zoo.github.io/master/#presentations/.
Thanks,
-Jason
On Fri, Apr 19, 2019 at 9:55 PM Khare, Ankit wrote:
> Thanks for sharing.
>
> Sent from my iPhone
>
> On 19. Apr 2019, at 01:35, Jason Dai wrote:
>
> Hi all,
>
>
> Please see b
Hi Jacek,
Thanks for your reply. Your provided case was actually same as my second
option in my original email. What I'm wondering was the difference between
those two regarding query performance or efficiency.
On Tue, May 14, 2019 at 3:51 PM Jacek Laskowski wrote:
> Hi,
>
> For this particular
Hi,
For this particular case I'd use Column.substr (
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column),
e.g.
val ns = Seq(("hello world", 1, 5)).toDF("w", "b", "e")
scala> ns.select($"w".substr($"b", $"e" - $"b" + 1) as "demo").show
+-+
| demo|
+-+
|hel
Hi Suket, Anastasios
Many thanks for your time and your suggestions!
I tried again with various settings for the watermarks and the trigger time
- watermark 20sec, trigger 2sec
- watermark 10sec, trigger 1sec
- watermark 20sec, trigger 0sec
I also tried continuous processing mode, but since I w
// Continuous trigger with one-second checkpointing intervaldf.writeStream
.format("console")
.trigger(Trigger.Continuous("1 second"))
.start()
On Tue, 14 May 2019 at 22:10, suket arora wrote:
> Hey Austin,
>
> If you truly want to process as a stream, use continuous streaming in
> spark
df = inputStream.withWatermark("eventtime", "20
seconds").groupBy("sharedId", window("20 seconds", "10 seconds")
// ProcessingTime trigger with two-seconds micro-batch interval
df.writeStream
.format("console")
.trigger(Trigger.ProcessingTime("2 seconds"))
.start()
On Tue, 14 May 201
For example, I have a dataframe with 3 columns: URL, START, END. For each
url from URL column, I want to fetch a substring of it starting from START
and ending at END.
++--+-+
|URL|START |END |
++--+-+
|
There are a little bit more than the list you specified nevertheless, some
data types are not directly compatible between Scala and Java and requires
conversion, so it’s good to not pollute your code with plenty of conversion and
focus on using the straight API.
I don’t remember from the top
Hi Anastasios
On 5/14/19 4:15 PM, Anastasios Zouzias wrote:
> Hi Joe,
>
> How often do you trigger your mini-batch? Maybe you can specify the trigger
> time explicitly to a low value or even better set it off.
>
> See:
> https://spark.apache.org/docs/latest/structured-streaming-programming-gui
Hi Suket
Sorry, this was a typo in the pseudo-code I sent. Of course that what you
suggested (using the same eventtime attribute for the watermark and the window)
is what my code does in reality. Sorry, to confuse people.
On 5/14/19 4:14 PM, suket arora wrote:
> Hi Joe,
> As per the spark struc
Hi all,
I am wondering why do we need Java-Friendly APIs in Spark ? Why can't we
just use scala apis in java codes ? What's the difference ?
Some examples of Java-Friendly APIs commented in Spark code are as follows:
JavaDStream
JavaInputDStream
JavaStreamingContext
JavaSparkContext
Hi Joe,
How often do you trigger your mini-batch? Maybe you can specify the trigger
time explicitly to a low value or even better set it off.
See:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
Best,
Anastasios
On Tue, May 14, 2019 at 3:49 PM Joe Amm
Hi Joe,
As per the spark structured streaming documentation and I quote
"withWatermark must be called on the same column as the timestamp column
used in the aggregate. For example, df.withWatermark("time", "1
min").groupBy("time2").count() is invalid in Append output mode, as
watermark is defined o
Hi all
I'm fairly new to Spark structured streaming and I'm only starting to develop
an understanding for the watermark handling.
Our application reads data from a Kafka input topic and as one of the first
steps, it has to group incoming messages. Those messages come in bulks, e.g. 5
messages
20 matches
Mail list logo