[Package Release] Widely accepted XGBoost now available in Spark

2016-03-16 Thread Nan Zhu
Dear Spark Users and Developers, (we apologize if you receive multiple copies of the email, we are resending because we found that our email was not delivered to user mail list correctly) We are happy to announce the release of XGBoost4J (http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed

[Streaming] textFileStream has no events shown in web UI

2016-03-16 Thread Hao Ren
Just a quick question, When using textFileStream, I did not see any events via web UI. Actually, I am uploading files to s3 every 5 seconds, And the mini-batch duration is 30 seconds. On web ui,: *Input Rate* Avg: 0.00 events/sec But the schedule time and processing time are correct, and the ou

Re: The build-in indexes in ORC file does not work.

2016-03-16 Thread Mich Talebzadeh
Hi, The parameters that control the stripe, row group are configurable via the ORC creation script CREATE TABLE dummy ( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(10) , PADDING VARCHAR(10) ) CLUSTERED BY (ID) INT

Re: spark.ml : eval model outside sparkContext

2016-03-16 Thread Peter Rudenko
Hi Emmanuel, looking for a similar solution. For now found only: https://github.com/truecar/mleap Thanks, Peter Rudenko On 3/16/16 12:47 AM, Emmanuel wrote: Hello, In MLLib with Spark 1.4, I was able to eval a model by loading it and using `predict` on a vector of features. I would train on

Re: Reg:Reading a csv file with String label into labelepoint

2016-03-16 Thread Yanbo Liang
Actually it's unnecessary to convert csv row to LabeledPoint, because we use DataFrame as the standard data format when training a model by Spark ML. What you should do is converting double attributes to Vector named "feature". Then you can train the ML model by specifying the featureCol and labelC

Re: Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-16 Thread Saisai Shao
If you want to avoid existing job failure while restarting NM, you could enable work preserving for NM, in this case, the restart of NM will not affect the running containers (containers can still run). That could alleviate NM restart problem. Thanks Saisai On Wed, Mar 16, 2016 at 6:30 PM, Alex D

Re: Spark Thriftserver

2016-03-16 Thread ayan guha
Thank you Jeff. However, I am more looking for fine grained access control. For example: something like Ranger. Do you know if Spark thriftserver supported by Ranger or Sentry? Or something similar? Much appreciated On Wed, Mar 16, 2016 at 1:49 PM, Jeff Zhang wrote: > It's same as hive thrif

Re: Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-16 Thread Alex Dzhagriev
Hi Vinay, I believe it's not possible as the spark-shuffle code should run in the same JVM process as the Node Manager. I haven't heard anything about on the fly bytecode loading in the Node Manger. Thanks, Alex. On Wed, Mar 16, 2016 at 10:12 AM, Vinay Kashyap wrote: > Hi all, > > I am using *

The build-in indexes in ORC file does not work.

2016-03-16 Thread Joseph
Hi all, I have known that ORC provides three level of indexes within each file, file level, stripe level, and row level. The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. Row level indexes i

Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-16 Thread Vinay Kashyap
Hi all, I am using *Spark 1.5.1* in *yarn-client* mode along with *CDH 5.5* As per the documentation to enable Dynamic Allocation of Executors in Spark, it is required to add the shuffle service jar to YARN Node Manager's classpath and restart the YARN Node Manager. Is there any way to to dynami

Re: [Streaming] Difference between windowed stream and stream with large batch size?

2016-03-16 Thread Hao Ren
Any ideas ? Feel free to ask me more details, if my questions are not clear. Thank you. On Mon, Mar 7, 2016 at 3:38 PM, Hao Ren wrote: > I want to understand the advantage of using windowed stream. > > For example, > > Stream 1: > initial duration = 5 s, > and then transformed into a stream wi

Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-16 Thread Imre Nagi
Hi, I'm just trying to process the data that come from the kafka source in my spark streaming application. What I want to do is get the pair of topic and message in a tuple from the message stream. Here is my streams: val streams = KafkaUtils.createDirectStream[String, Array[Byte], > StringDeco

Re: newbie HDFS S3 best practices

2016-03-16 Thread Chris Miller
If you have lots of small files, distcp should handle that well -- it's supposed to distribute the transfer of files across the nodes in your cluster. Conductor looks interesting if you're trying to distribute the transfer of single, large file(s)... right? -- Chris Miller On Wed, Mar 16, 2016 a

Re: Does parallelize and collect preserve the original order of list?

2016-03-16 Thread Chris Miller
Short answer: Nope Less short answer: Spark is not designed to maintain sort order in this case... it *may*, but there's no guarantee... generally, it would not be in the same order unless you implement something to order by and then sort the result based on that. -- Chris Miller On Wed, Mar 16,

Re: reading file from S3

2016-03-16 Thread Chris Miller
+1 for Sab's thoughtful answer... Yasemin: As Gourav said, using IAM roles is considered best practice and generally will give you fewer headaches in the end... but you may have a reason for doing it the way you are, and certainly the way you posted should be supported and not cause the error you

Re: reading file from S3

2016-03-16 Thread Yasemin Kaya
Hi, Thanx a lot all, I understand my problem comes from *hadoop version* and I move the spark 1.6.0 *hadoop 2.4 *version and there is no problem. Best, yasemin 2016-03-15 17:31 GMT+02:00 Gourav Sengupta : > Once again, please use roles, there is no way that you have to specify the > access keys

Re: exception while running job as pyspark

2016-03-16 Thread Jeff Zhang
Please try export PYSPARK_PYTHON= On Wed, Mar 16, 2016 at 3:00 PM, ram kumar wrote: > Hi, > > I get the following error when running a job as pyspark, > > {{{ > An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job ab

exception while running job as pyspark

2016-03-16 Thread ram kumar
Hi, I get the following error when running a job as pyspark, {{{ An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in