o if you want to test some behavior on
> OutputMode.Append you would need to add a dummy record to move watermark
> forward.
>
> Hope this helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry
> wrote:
>
>> Sorry,
still struggling. Can anybody please help?
How do people test their SSS code if you have to put a message on Kafka to
get Spark to consume a batch?
Kind regards,
Phillip
On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry
wrote:
> Hi, folks.
>
> I noticed that SSS won't process a waiting
Hi, folks.
I noticed that SSS won't process a waiting batch if there are no batches
after that. To put it another way, Spark must always leave one batch on
Kafka waiting to be consumed.
There is a JIRA for this at:
https://issues.apache.org/jira/browse/SPARK-24156
that says it's resolved in 2.4
Strongly suspect you're mutating an object at the point in time it is
Serialized.
I suggest you remove all mutation from your code.
HTH.
Phillip
On Fri, Dec 13, 2019 at 7:19 PM Ravi Aggarwal
wrote:
> Hi,
>
>
>
> We are encountering java OptionalDataException in one of our spark jobs.
>
> All
I saw a large improvement in my GraphX processing by:
- using fewer partitions
- using fewer executors but with much more memory.
YMMV.
Phillip
On Mon, 25 Nov 2019, 19:14 mahzad kalantari,
wrote:
> Thanks for your answer, my use case is friend recommandation for 200
> million profils.
>
> Le
Hi, Arwin.
If I understand you correctly, this is totally expected behaviour.
I don't know much about saving to S3 but maybe you could write to HDFS
first then copy everything to S3? I think the write to HDFS will probably
be much faster as Spark/HDFS will write locally or to a machine on the sam
Too little information to give an answer, if indeed an answer a priori is
possible.
However, I would do the following on your test instances:
- Run jstat -gc on all your nodes. It might be that the GC is taking a lot
of time.
- Poll with jstack semi frequently. I can give you a fairly good idea
Hi,
I'm in a somewhat similar situation. Here's what I do (it seems to be
working so far):
1. Stream in the JSON as a plain string.
2. Feed this string into a JSON library to validate it (I use Circe).
3. Using the same library, parse the JSON and extract fields X, Y and Z.
4. Create a dataset wi
Hi, Denis.
It should be a String. Even if it looks like a number when you do hadoop fs
-ls ..., it's a String representation of a date/time.
Phillip
On Thu, Jan 10, 2019 at 2:00 PM ddebarbieux wrote:
> cala> spark.read.schema(StructType(Seq(StructField("_1",StringType,false),
> StructField("_2
Hi,
I write a stream of (String, String) tuples to HDFS partitioned by the
first ("_1") member of the pair.
Everything looks great when I list the directory via "hadoop fs -ls ...".
However, when I try to read all the data as a single dataframe, I get
unexpected results (see below).
I notice th
Hi, folks.
I am using Spark 2.2.0 and a combination of Spark ML's LinearSVC and
OneVsRest to classify some documents that are originally read from HDFS
using sc.wholeTextFiles.
When I use it on small documents, I get the results in a few minutes. When
I use it on the same number of large document
A back-of-a-beermat calculation says if you have, say, 20 boxes, saving 1TB
should take approximately 15 minutes (with a replication factor of 1 since
you don't need it higher for ephemeral data that is relatively easy to
generate).
This isn't much if the whole job takes hours. You get the added b
functionality of
> spark is spill to disk. So it still doesn't provide DB like reliability in
> execution. In case of DBs, queries get slow but they don't fail or go out
> of memory, specifically in concurrent user scenarios.
>
> Regards,
> Ashish
>
> On Nov 12,
Agree with Jorn. The answer is: it depends.
In the past, I've worked with data scientists who are happy to use the
Spark CLI. Again, the answer is "it depends" (in this case, on the skills
of your customers).
Regarding sharing resources, different teams were limited to their own
queue so they cou
Hi, John.
I've had similar problems. IIRC, the driver was GCing madly. I don't know
why the driver was doing so much work but I quickly implemented an
alternative approach. The code I wrote belongs to my client but I wrote
something that should be equivalent. It can be found at:
https://github.co
Looks like a standard "not enough memory" issue. I can only recommend the
usual advice of increasing the number of partitions to give you a quick-win.
Also, your JVMs have an enormous amount of memory. This may cause long GC
pause times. You might like to try reducing the memory to about 20gb and
Hi,
I notice that some DistributedMatrix represent the number of columns with
an Int rather than a Long (RowMatrix etc). This limits the number of
columns to about 2 billion.
We're approaching that limit. What do people recommend we do to mitigate
the problem? Are there plans to use a larger data
17 matches
Mail list logo