Re: unsubscribe

2015-08-02 Thread Akhil Das
​LOL Brandon! @ziqiu See http://spark.apache.org/community.html You need to send an email to user-unsubscr...@spark.apache.org​ Thanks Best Regards On Fri, Jul 31, 2015 at 2:06 AM, Brandon White wrote: > https://www.youtube.com/watch?v=JncgoPKklVE > > On Thu, Jul 30, 2015 at 1:30 PM, wrote:

Re: Does Spark Streaming need to list all the files in a directory?

2015-08-02 Thread Akhil Das
I guess it goes through that 500k files for the first time and then use a filter from next time. Thanks Best Regards On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das

Re: About memory leak in spark 1.4.1

2015-08-02 Thread Barak Gitsis
Hi, reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get filled because it is reserved.. My reasoning is: I give executor all the memory i can give it, so that makes it a boundary. >From here i try to make the best use of memory I can. storage.memoryFraction is in a sense us

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Akhil Das
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA to add this feature and may be in future release it could be added. Thanks Best Regards On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly wrote: > Hi, > > I am currently working on the latest version of Apache Spark (1.4

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
Hi, Barak It is ok with spark 1.3.0, the problem is with spark 1.4.1. I don't think spark.storage.memoryFraction will make any sense, because it is still in heap memory. -- -- ??: "Barak Gitsis";; : 2015??8??2??(??) 4:11 ???

Re: About memory leak in spark 1.4.1

2015-08-02 Thread Ted Yu
http://spark.apache.org/docs/latest/tuning.html does mention spark.storage.memoryFraction in two places. One is under Cache Size Tuning section. FYI On Sun, Aug 2, 2015 at 2:16 AM, Sea <261810...@qq.com> wrote: > Hi, Barak > It is ok with spark 1.3.0, the problem is with spark 1.4.1. > I

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Jörn Franke
I think you use case can already be implemented with HDFS encryption and/or SealedObject, if you look for sth like Altibase. If you create a JIRA you may want to set the bar a little bit higher and propose sth like MIT cryptdb: https://css.csail.mit.edu/cryptdb/ Le ven. 31 juil. 2015 à 10:17, Mat

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
spark.storage.memoryFraction is in heap memory, but my situation is that the memory is more than heap memory ! Anyone else use spark 1.4.1 in production? -- -- ??: "Ted Yu";; : 2015??8??2??(??) 5:45 ??: "Sea"<261810...@qq.co

Re: About memory leak in spark 1.4.1

2015-08-02 Thread Barak Gitsis
spark uses a lot more than heap memory, it is the expected behavior. in 1.4 off-heap memory usage is supposed to grow in comparison to 1.3 Better use as little memory as you can for heap, and since you are not utilizing it already, it is safe for you to reduce it. memoryFraction helps you optimize

spark no output

2015-08-02 Thread Pa Rö
hi community, i have run my k-means spark application on 1million data points. the program works, but no output in the hdfs is generated. when it runs on 10.000 points, a output is written. maybe someone has an idea? best regards, paul

Re: spark no output

2015-08-02 Thread Ted Yu
Can you provide some more detai: release of Spark you're using were you running in standalone or YARN cluster mode have you checked driver log ? Cheers On Sun, Aug 2, 2015 at 7:04 AM, Pa Rö wrote: > hi community, > > i have run my k-means spark application on 1million data points. the > progra

Re: spark no output

2015-08-02 Thread Connor Zanin
I agree with Ted. Could you please post the log file? On Aug 2, 2015 10:13 AM, "Ted Yu" wrote: > Can you provide some more detai: > > release of Spark you're using > were you running in standalone or YARN cluster mode > have you checked driver log ? > > Cheers > > On Sun, Aug 2, 2015 at 7:04 AM,

Re: TCP/IP speedup

2015-08-02 Thread Michael Segel
This may seem like a silly question… but in following Mark’s link, the presentation talks about the TPC-DS benchmark. Here’s my question… what benchmark results? If you go over to the TPC.org website they have no TPC-DS benchmarks listed. (Either audited or unaudited) So

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
No one has any ideas? Is there some more information I should provide? I am looking for ways to increase the parallelism among workers. Currently I just see number of simultaneous connections to Solr equal to the number of workers. My number of partitions is (2.5x) larger than number of workers,

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
What kind of cluster? How many cores on each worker? Is there config for http solr client? I remember standard httpclient has limit per route/host. On Aug 2, 2015 8:17 PM, "Sujit Pal" wrote: > No one has any ideas? > > Is there some more information I should provide? > > I am looking for ways to

how to ignore MatchError then processing a large json file in spark-sql

2015-08-02 Thread fuellee lee
I'm trying to process a bunch of large json log files with spark, but it fails every time with `scala.MatchError`, Whether I give it schema or not. I just want to skip lines that does not match schema, but I can't find how in docs of spark. I know write a json parser and map it to json file RDD c

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
Hi Igor, The cluster is a Databricks Spark cluster. It consists of 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The original mail has some more details (also the reference to the HttpSolrClient in there should be HttpSolrServer, sorry about that, mistake while writing the email). Th

RE: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Silvio Fiorito
Can you share the transformations up to the foreachPartition? From: Sujit Pal Sent: ‎8/‎2/‎2015 4:42 PM To: Igor Berman Cc: user Subject: Re: How to increase parallelism of a

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
so how many cores you configure per node? do u have something like total-executor-cores or maybe --num-executors config(I'm not sure what kind of cluster databricks platform provides, if it's standalone then first option should be used)? if you have 4 cores at total, then even though you have 4

Re: TCP/IP speedup

2015-08-02 Thread Steve Loughran
On 1 Aug 2015, at 18:26, Ruslan Dautkhanov mailto:dautkha...@gmail.com>> wrote: If your network is bandwidth-bound, you'll see setting jumbo frames (MTU 9000) may increase bandwidth up to ~20%. http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for_Hadoop.htm "Enabling Jum

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Steve Loughran
On 2 Aug 2015, at 13:42, Sujit Pal mailto:sujitatgt...@gmail.com>> wrote: There is no additional configuration on the external Solr host from my code, I am using the default HttpClient provided by HttpSolrServer. According to the Javadocs, you can pass in a HttpClient object as well. Is there

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Abhishek R. Singh
I don't know if (your assertion/expectation that) workers will process things (multiple partitions) in parallel is really valid. Or if having more partitions than workers will necessarily help (unless you are memory bound - so partitions is essentially helping your work size rather than executio

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
"spark uses a lot more than heap memory, it is the expected behavior." It didn't exist in spark 1.3.x What does "a lot more than" means? It means that I lose control of it! I try to apply 31g, but it still grows to 55g and continues to grow!!! That is the point! I have tried set memoryFraction

Re: spark cluster setup

2015-08-02 Thread Sonal Goyal
What do the master logs show? Best Regards, Sonal Founder, Nube Technologies Check out

Cannot Import Package (spark-csv)

2015-08-02 Thread billchambers
I am trying to import the spark csv package while using the scala spark shell. Spark 1.4.1, Scala 2.11 I am starting the shell with: bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars ../sjars/spark-csv_2.11-1.1.0.jar --master local I then try and run and get the following

Re: Cannot Import Package (spark-csv)

2015-08-02 Thread Ted Yu
The command you ran and the error you got were not visible. Mind sending them again ? Cheers On Sun, Aug 2, 2015 at 8:33 PM, billchambers wrote: > I am trying to import the spark csv package while using the scala spark > shell. Spark 1.4.1, Scala 2.11 > > I am starting the shell with: > > bin/

Re: Cannot Import Package (spark-csv)

2015-08-02 Thread billchambers
Commands again are: Sure the commands are: scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv") and get the following error: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package

Re: Cannot Import Package (spark-csv)

2015-08-02 Thread Ted Yu
I tried the following command on master branch: bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 --jars ../spark-csv_2.10-1.0.3.jar --master local I didn't reproduce the error with your command. FYI On Sun, Aug 2, 2015 at 8:57 PM, Bill Chambers < wchamb...@ischool.berkeley.edu> wro

Checkpoint file not found

2015-08-02 Thread Anand Nalya
Hi, I'm writing a Streaming application in Spark 1.3. After running for some time, I'm getting following execption. I'm sure, that no other process is modifying the hdfs file. Any idea, what might be the cause of this? 15/08/02 21:24:13 ERROR scheduler.DAGSchedulerEventProcessLoop: DAGSchedulerEv

Extremely poor predictive performance with RF in mllib

2015-08-02 Thread pkphlam
Hi, This might be a long shot, but has anybody run into very poor predictive performance using RandomForest with Mllib? Here is what I'm doing: - Spark 1.4.1 with PySpark - Python 3.4.2 - ~30,000 Tweets of text - 12289 1s and 15956 0s - Whitespace tokenization and then hashing trick for feature s

Re: spark cluster setup

2015-08-02 Thread Sonal Goyal
Your master log files will be on the spark home folder/logs at the master machine. Do they show an error ? Best Regards, Sonal Founder, Nube Technologies Check out Reifier at Spark Summit 2015