Re: Windows shared folder

2015-11-27 Thread Jörn Franke
Your approach requires still that all data is sent via the master every time you want to process it. You probably want to use Hadoop HDFS as a distributed file system and exploit its data locality features. This seems to be very suitable for your scenario. However, I recommend to use it with a

Give parallelize a dummy Arraylist length N to control RDD size?

2015-11-27 Thread Jim
Hello there, (part of my problem is docs that say "undocumented" on parallelize leave me reading books for examples that don't always pertain

Windows shared folder

2015-11-27 Thread Shuo Wang
Hi, I am trying to build a small home spark cluster on windows. I have a question regarding how to share the data files for the master node and worker nodes to process. The data files are pretty large, a few 100G. Can I just use windows shared folder as the file path for my driver/master, and wo

Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-27 Thread Ted Yu
ec2/spark-ec2 calls ./ec2/spark_ec2.py I don't see PYTHONHASHSEED defined in any of these scripts. Andy reported this for ec2 cluster. I think a JIRA should be opened. On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung wrote: > May I ask how you are starting Spark? > It looks like PYTHONHASHSEED

RE: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-27 Thread Felix Cheung
May I ask how you are starting Spark? It looks like PYTHONHASHSEED is being set: https://github.com/apache/spark/search?utf8=%E2%9C%93&q=PYTHONHASHSEED Date: Thu, 26 Nov 2015 11:30:09 -0800 Subject: possible bug spark/python/pyspark/rdd.py portable_hash() From: a...@santacruzintegration.com To:

Re: Spark Streaming on mesos

2015-11-27 Thread Nagaraj Chandrashekar
Hi Renjie, I have not setup Spark Streaming on Mesos but there is something called reservations in Mesos. It supports both Static and Dynamic reservations. Both types of reservations must have role defined. You may want to explore these options. Excerpts from the Apache Mesos documentation.

Spark Streaming on mesos

2015-11-27 Thread Renjie Liu
Hi, all: I'm trying to run spark streaming on mesos and it seems that none of the scheduler is suitable for that. Fine grain scheduler will start an executor for each task so it will significantly increase the latency. While coarse grained mode can only set the max core numbers and executor memory

RE: Hive using Spark engine alone

2015-11-27 Thread Mich Talebzadeh
Thanks Jorn for your interest and appreciate your helpful comments. I am primarily interested in this case to make Hive work with Spark engine. It may well be that it is work in progress and we have to wait for it. Regards, Mich From: Jörn Franke [mailto:jornfra...@gmail.com] Sent

Re: Cant start master on windows 7

2015-11-27 Thread Shuo Wang
Hi, yeah, not much there actually, like the following. Spark Command: C:\Program Files (x86)\Java\jre1.8.0_60\bin\java -cp C:/spark-1.5.2-bin-hadoop2.6/sbin/../conf\;C:/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-api-jdo-3.2.6.ja

Re: Cant start master on windows 7

2015-11-27 Thread Ted Yu
Have you checked the contents of spark--org.apache.spark.deploy.master. Master-1-My-PC.out ? Cheers On Fri, Nov 27, 2015 at 7:27 AM, Shuo Wang wrote: > Hi, > > I am trying to use the start-master.sh script on windows 7. But it failed > to start master, and give the following error, > > ps: unkn

Cant start master on windows 7

2015-11-27 Thread Shuo Wang
Hi, I am trying to use the start-master.sh script on windows 7. But it failed to start master, and give the following error, ps: unknown option -- o Try `ps --help' for more information. starting org.apache.spark.deploy.master.Master, logging to /c/spark-1.5.2-bin-hadoop2.6/sbin/../logs/spark--or

Re: Hive using Spark engine alone

2015-11-27 Thread Jörn Franke
Hi, I recommend to use the latest version of Hive. You may also wait for hive on tez with tez version >= 0.8 and hive > 1.2. Before that I recommend first trying other optimizations of Hive and have a look at the storage format together with storage indexes (not the regular ones), bloom filters

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-11-27 Thread Nisrina Luthfiyati
Hi Mich, thank you for the answer. Regarding the diagrams, I'm specifically referring to the direct line between spark yarn client to spark executor in the first diagram which implies direct communication to executor when issuing application commands. And the 'Application commands' & 'Issue applica

RE: RE: error while creating HiveContext

2015-11-27 Thread Chandra Mohan, Ananda Vel Murugan
Hi Sun, I could connect to Hive in spark command line and run sql commands. So I don’t think it is the problem with hive config file. Regards, Anand.C From: fightf...@163.com [mailto:fightf...@163.com] Sent: Friday, November 27, 2015 3:25 PM To: Chandra Mohan, Ananda Vel Murugan ; user Subjec

RE: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-11-27 Thread Mich Talebzadeh
Hi, In general YARN is used as the resource scheduler regardless of the execution engine whether it is MapReduce or Spark. Yarn will create a resource container for the submitted job (that is the Spark client) and will execute it in the default engine (in this case Spark). There will be

In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-11-27 Thread Nisrina Luthfiyati
Hi all, I'm trying to understand how yarn-client mode works and found these two diagrams: In the first diagram, it looks like the driver running in client directly communicates with executors to issue application commands, while in the second diagram it looks like application commands is sent t

Hive using Spark engine alone

2015-11-27 Thread Mich Talebzadeh
Hi, As a matter of interest has anyone installed and configured Spark to be used as the execution engine for Hive please? This is in contrast to install and configure Spark as an application. Spark by default uses MapReduce as its execution engine which is more for batch processing. Th

Permanent RDD growing with Kafka DirectStream

2015-11-27 Thread u...@moosheimer.com
Hi, we have some strange behavior with KafkaUtils DirectStream and the size of the MapPartitionsRDDs. We use a permanent direct steam where we consume about 8.500 json messages/sec. The json messages are read, some information are extracted and the result of each json is a string which collect/gr

Re: RE: error while creating HiveContext

2015-11-27 Thread fightf...@163.com
Could you provide your hive-site.xml file info ? Best, Sun. fightf...@163.com From: Chandra Mohan, Ananda Vel Murugan Date: 2015-11-27 17:04 To: fightf...@163.com; user Subject: RE: error while creating HiveContext Hi, I verified and I could see hive-site.xml in spark conf directory. Re

Re: thought experiment: use spark ML to real time prediction

2015-11-27 Thread Nick Pentreath
Yup, I agree that Spark (or whatever other ML system) should be focused on model training rather than real-time scoring. And yes, in most cases trained models easily fit on a single machine. I also agree that, while there may be a few use cases out there, Spark Streaming is generally not well-suite

RE: error while creating HiveContext

2015-11-27 Thread Chandra Mohan, Ananda Vel Murugan
Hi, I verified and I could see hive-site.xml in spark conf directory. Regards, Anand.C From: fightf...@163.com [mailto:fightf...@163.com] Sent: Friday, November 27, 2015 12:53 PM To: Chandra Mohan, Ananda Vel Murugan ; user Subject: Re: error while creating HiveContext Hi, I think you just wa

Re: WARN MemoryStore: Not enough space

2015-11-27 Thread Gylfi
"spark.storage.memoryFraction 0.05" If you want to store a lot of memory I think this must be a higher fraction. The default is 0.6 (not 0.0X). To change the output directory you can set "spark.local.dir=/path/to/dir" and you can even specify multiple directories (for example if you have multi

Re: thought experiment: use spark ML to real time prediction

2015-11-27 Thread Nick Pentreath
Thanks for that link Vincenzo. PFA definitely seems interesting - though I see it is quite wide in scope, almost like its own mini math/programming language. Do you know if there are any reference implementations in code? I don't see any on the web site or the DMG github. On Sun, Nov 22, 2015 at

Re: Millions of entities in custom Hadoop InputFormat and broadcast variable

2015-11-27 Thread Jeff Zhang
Where do you load all IDs of your dataset ? In your custom InputFormat#getSplits ? getSplits will be invoked in driver side to build the Partition which will be serialized to executor as part of the task. Do you put all the ids in the InputSplit ? That would make it pretty large. In your case, I

how to using local repository in spark[dev]

2015-11-27 Thread lihu
Hi, All: I modify the spark code and try to use some extra jars in Spark, the extra jars is published in my local maven repository using* mvn install*. However the sbt can not find this jars file, even I can find this jar fils under* /home/myname/.m2/resposiroty*. I can guarantee tha