Hi Raju,
Have you tried setNumPartitions with a larger number?
2017-03-07 0:30 GMT-08:00 Eli Super :
> Hi
>
> It's area of knowledge , you will need to read online several hours about
> it
>
> What is your programming language ?
>
> Try search online : "machine learning binning %my_programing_la
To add to what Michael said, my experience was that Structured Streaming in
2.0 was half-baked / alpha, but in 2.1 it is significantly more robust.
Also a lot of its "missing functionality" were not available in Spark
Streaming either way.
HOWEVER, you mentioned that you think about rewriting your
Hi Yuhao,
I have tried numPartitions from (numExecutors * numExecutorCores),
1000, 2000 and 1. I did not see much improvement.
Having more partitions solved some perf issues but did not see any
improvement when I give less minsupport.
It is generating 260 million frequent item sets with 6
Thanks a lot Michal & Ofir for your insights.
To Ofir - I have not yet finalized my spark streaming code. it is still
work in progress. Now we have Structured streaming available, so thought to
re write it to gain maximum benefit in future. As of now, there are no
specific functional or performanc
On 14 Mar 2017 4:19 p.m., Gaurav Pandya wrote:
Thanks a lot Michal & Ofir for your insights.
To Ofir - I have not yet finalized my spark streaming code. it is still work in
progress. Now we have Structured streaming available, so thought to re write it
to gain maximum benefit in future. As of
On 13 Mar 2017, at 13:24, Sam Elamin
mailto:hussam.ela...@gmail.com>> wrote:
Hi Jorn
Thanks for the prompt reply, really we have 2 main concerns with CD, ensuring
tests pasts and linting on the code.
I'd add "providing diagnostics when tests fail", which is a combination of:
tests providing
I agree the reporting is an important aspect. Sonarqube (or similar tool) can
report over time, but does not support Scala (well indirectly via JaCoCo). In
the end, you will need to think about a dashboard that displays results over
time.
> On 14 Mar 2017, at 12:44, Steve Loughran wrote:
>
>
Thank you both
Steve that's a very interesting point. I have to admit I have never thought
of doing analysis over time on the tests but it makes sense as the failures
over time tell you quite a bit about your data platform
Thanks for highlighting! We are using Pyspark for now so I hope some
frame
Hi everybody,
I make some experiments with the Spark kmeans implementation of the new
DataFrame-API. I compare clustering results of different runs with
different parameters. I recognized that for random initialization mode, the
seed value is the same every time. How is it calculated? In my
unders
I'm sorry, I missed some important informations. I use Spark version 2.0.2
in Scala 2.11.8.
2017-03-14 13:44 GMT+01:00 Julian Keppel :
> Hi everybody,
>
> I make some experiments with the Spark kmeans implementation of the new
> DataFrame-API. I compare clustering results of different runs with
>
Thanks Kwon. Goal is to preserve whitespace. Not to alter data in general
or do it with user provided options. It's causing our downstream jobs to
fail.
On Mon, Mar 13, 2017 at 7:23 PM, Hyukjin Kwon wrote:
> Hi, all the options are documented in https://spark.apache.org/
> docs/latest/api/scala
I am hoping to open a discussion around the cross validation in mllib. I
found that I often wanted to evaluate multiple estimators/pipelines (with
different algorithms) or the same estimator with different parameter grids.
The CrossValidator and TrainValidationSplit only allow a single estimator
an
To work around an out of space issue in a Direct Kafka Streaming
application we create topics with a low retention policy (retention.ms=30)
which works fine from Kafka perspective. However this results
into OffsetOutOfRangeException in Spark job (red line below). Is there any
configuration in S
Thanks TD for the response. Can you please provide more explanation. I am
having multiple streams in the spark streaming application (Spark 2.0.2
using DStreams). I know many people using this setting. So your
explanation will help a lot of people.
Thanks
On Fri, Mar 10, 2017 at 6:24 PM, Tathag
Hi All,
I am currently on Spark 1.6 and I was doing a sql join on two tables that
are over 100 million rows each and I noticed that it was spawn 3+ tasks
(this is the progress meter that we are seeing show up). We tried to
coalesece, repartition and shuffle partitions to drop the number of ta
Hi,
try using this parameter --conf spark.sql.shuffle.partitions=1000
Thanks,
Mohini
On Tue, Mar 14, 2017 at 3:30 PM, kpeng1 wrote:
> Hi All,
>
> I am currently on Spark 1.6 and I was doing a sql join on two tables that
> are over 100 million rows each and I noticed that it was spawn 3+ ta
This setting allows multiple spark jobs generated through multiple
foreachRDD to run concurrently, even if they are across batches. So output
op2 from batch X, can run concurrently with op1 of batch X+1
This is not safe because it breaks the checkpointing logic in subtle ways.
Note that this was ne
How to scale or possibly auto-scale a spark streaming application consuming
from kafka and using kafka direct streams. We are using spark 1.6.3, cannot
move to 2.x unless there is a strong reason.
Scenario:
Kafka topic with 10 partitions
Standalone cluster running on kubernetes with 1 master and 2
Hi Jinhong,
Based on the error message, your second collection of vectors has a
dimension of 804202, while the dimension of your training vectors
was 144109. So please make sure your test dataset are of the same dimension
as the training data.
>From the test dataset you posted, the vector dimens
Hello there,
I have one or more parquet files to read and perform some aggregate queries
using Spark Dataframe. I would like to find a reasonable fast datastore that
allows me to write the results for subsequent (simpler queries).
I did attempt to use ElasticSearch to write the query results using
Hello,
I have a schema and name of columns to apply UDF to. Name of columns are
user input and they can differ in numbers for each input.
Is there a way to apply UDFs to N columns in dataframe ?
Thanks !
Hi Julian,
Thanks for reporting this. This is a valid issue and I created
https://issues.apache.org/jira/browse/SPARK-19957 to track it.
Right now the seed is set to this.getClass.getName.hashCode.toLong by
default, which indeed keeps the same among multiple fits. Feel free to
leave your comments
22 matches
Mail list logo