Dear Spark Users and Developers,
(we apologize if you receive multiple copies of the email, we are resending
because we found that our email was not delivered to user mail list correctly)
We are happy to announce the release of XGBoost4J
(http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed
Just a quick question,
When using textFileStream, I did not see any events via web UI.
Actually, I am uploading files to s3 every 5 seconds,
And the mini-batch duration is 30 seconds.
On web ui,:
*Input Rate*
Avg: 0.00 events/sec
But the schedule time and processing time are correct, and the ou
Hi,
The parameters that control the stripe, row group are configurable via the
ORC creation script
CREATE TABLE dummy (
ID INT
, CLUSTERED INT
, SCATTERED INT
, RANDOMISED INT
, RANDOM_STRING VARCHAR(50)
, SMALL_VC VARCHAR(10)
, PADDING VARCHAR(10)
)
CLUSTERED BY (ID) INT
Hi Emmanuel, looking for a similar solution. For now found only:
https://github.com/truecar/mleap
Thanks,
Peter Rudenko
On 3/16/16 12:47 AM, Emmanuel wrote:
Hello,
In MLLib with Spark 1.4, I was able to eval a model by loading it and
using `predict` on a vector of features.
I would train on
Actually it's unnecessary to convert csv row to LabeledPoint, because we
use DataFrame as the standard data format when training a model by Spark ML.
What you should do is converting double attributes to Vector named
"feature". Then you can train the ML model by specifying the featureCol and
labelC
If you want to avoid existing job failure while restarting NM, you could
enable work preserving for NM, in this case, the restart of NM will not
affect the running containers (containers can still run). That could
alleviate NM restart problem.
Thanks
Saisai
On Wed, Mar 16, 2016 at 6:30 PM, Alex D
Thank you Jeff. However, I am more looking for fine grained access control.
For example: something like Ranger. Do you know if Spark thriftserver
supported by Ranger or Sentry? Or something similar? Much appreciated
On Wed, Mar 16, 2016 at 1:49 PM, Jeff Zhang wrote:
> It's same as hive thrif
Hi Vinay,
I believe it's not possible as the spark-shuffle code should run in the
same JVM process as the Node Manager. I haven't heard anything about on the
fly bytecode loading in the Node Manger.
Thanks, Alex.
On Wed, Mar 16, 2016 at 10:12 AM, Vinay Kashyap wrote:
> Hi all,
>
> I am using *
Hi all,
I have known that ORC provides three level of indexes within each file, file
level, stripe level, and row level.
The file and stripe level statistics are in the file footer so that they are
easy to access to determine if the rest of the file needs to be read at all.
Row level indexes i
Hi all,
I am using *Spark 1.5.1* in *yarn-client* mode along with *CDH 5.5*
As per the documentation to enable Dynamic Allocation of Executors in Spark,
it is required to add the shuffle service jar to YARN Node Manager's
classpath and restart the YARN Node Manager.
Is there any way to to dynami
Any ideas ?
Feel free to ask me more details, if my questions are not clear.
Thank you.
On Mon, Mar 7, 2016 at 3:38 PM, Hao Ren wrote:
> I want to understand the advantage of using windowed stream.
>
> For example,
>
> Stream 1:
> initial duration = 5 s,
> and then transformed into a stream wi
Hi,
I'm just trying to process the data that come from the kafka source in my
spark streaming application. What I want to do is get the pair of topic and
message in a tuple from the message stream.
Here is my streams:
val streams = KafkaUtils.createDirectStream[String, Array[Byte],
> StringDeco
If you have lots of small files, distcp should handle that well -- it's
supposed to distribute the transfer of files across the nodes in your
cluster. Conductor looks interesting if you're trying to distribute the
transfer of single, large file(s)...
right?
--
Chris Miller
On Wed, Mar 16, 2016 a
Short answer: Nope
Less short answer: Spark is not designed to maintain sort order in this
case... it *may*, but there's no guarantee... generally, it would not be in
the same order unless you implement something to order by and then sort the
result based on that.
--
Chris Miller
On Wed, Mar 16,
+1 for Sab's thoughtful answer...
Yasemin: As Gourav said, using IAM roles is considered best practice and
generally will give you fewer headaches in the end... but you may have a
reason for doing it the way you are, and certainly the way you posted
should be supported and not cause the error you
Hi,
Thanx a lot all, I understand my problem comes from *hadoop version* and I
move the spark 1.6.0 *hadoop 2.4 *version and there is no problem.
Best,
yasemin
2016-03-15 17:31 GMT+02:00 Gourav Sengupta :
> Once again, please use roles, there is no way that you have to specify the
> access keys
Please try export PYSPARK_PYTHON=
On Wed, Mar 16, 2016 at 3:00 PM, ram kumar wrote:
> Hi,
>
> I get the following error when running a job as pyspark,
>
> {{{
> An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job ab
Hi,
I get the following error when running a job as pyspark,
{{{
An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
18 matches
Mail list logo