Optimization module in Python mllib

2015-06-07 Thread martingoodson
Am I right in thinking that Python mllib does not contain the optimization module? Are there plans to add this to the Python api? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Optimization-module-in-Python-mllib-tp23191.html Sent from the Apache Spark User

Re: hiveContext.sql NullPointerException

2015-06-07 Thread Cheng Lian
Hi, This is expected behavior. HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side. However, the closure passed to RDD.foreach is executed on executor side, where no viable HiveContext instance exists. Cheng On 6/7/15 10:06 AM, patcharee wrot

Monitoring Spark Jobs

2015-06-07 Thread SamyaMaiti
Hi All, I have a Spark SQL application to fetch data from Hive, on top I have a akka layer to run multiple Queries in parallel. *Please suggest a mechanism, so as to figure out the number of spark jobs running in the cluster at a given instance of time. * I need to do the above as, I see the ave

Re: Running SparkSql against Hive tables

2015-06-07 Thread Cheng Lian
On 6/6/15 9:06 AM, James Pirz wrote: I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 'Spark SQL' to run some SQL scripts, on the cluster. I realized that for a better performance, it is a good idea to use Parquet files. I have 2 questions regarding that: 1) If I wanna us

Re: Avro or Parquet ?

2015-06-07 Thread Cheng Lian
Usually Parquet can be more efficient because of its columnar nature. Say your table has 10 columns but your join query only touches 3 of them, Parquet only reads those 3 columns from disk while Avro must load all data. Cheng On 6/5/15 3:00 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: We currently have data in a

Re: hiveContext.sql NullPointerException

2015-06-07 Thread patcharee
Hi, How can I expect to work on HiveContext on the executor? If only the driver can see HiveContext, does it mean I have to collect all datasets (very large) to the driver and use HiveContext there? It will be memory overload on the driver and fail. BR, Patcharee On 07. juni 2015 11:51, Che

Re: Problem reading Parquet from 1.2 to 1.3

2015-06-07 Thread Cheng Lian
This issue has been fixed recently in Spark 1.4 https://github.com/apache/spark/pull/6581 Cheng On 6/5/15 12:38 AM, Marcelo Vanzin wrote: I talked to Don outside the list and he says that he's seeing this issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a real i

Re: NullPointerException SQLConf.setConf

2015-06-07 Thread Cheng Lian
Are you calling hiveContext.sql within an RDD.map closure or something similar? In this way, the call actually happens on executor side. However, HiveContext only exists on the driver side. Cheng On 6/4/15 3:45 PM, patcharee wrote: Hi, I am using Hive 0.14 and spark 0.13. I got java.lang.Nu

Re: Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-07 Thread Cheng Lian
For the following code: val df = sqlContext.parquetFile(path) `df` remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code: val cdf = df.cache() `cdf` is also columnar but that's different from Parquet. When a DataFrame is cached, Spa

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread Cheng Lian
Interesting, just posted on another thread asking exactly the same question :) My answer there quoted below: > For the following code: > > val df = sqlContext.parquetFile(path) > > `df` remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code:

Re: spark sql - reading data from sql tables having space in column names

2015-06-07 Thread Cheng Lian
You can use backticks to quote the column names. Cheng On 6/3/15 2:49 AM, David Mitchell wrote: I am having the same problem reading JSON. There does not seem to be a way of selecting a field that has a space, "Executor Info" from the Spark logs. I suggest that we open a JIRA ticket to ad

Re: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-07 Thread Cheng Lian
Were you using HiveContext.setConf()? "dfs.replication" is a Hadoop configuration, but setConf() is only used to set Spark SQL specific configurations. You may either set it in your Hadoop core-site.xml. Cheng On 6/2/15 2:28 PM, Haopu Wang wrote: Hi, I'm trying to save SparkSQL DataFrame

Re: Caching parquet table (with GZIP) on Spark 1.3.1

2015-06-07 Thread Cheng Lian
Is it possible that some Parquet files of this data set have different schema as others? Especially those ones reported in the exception messages. One way to confirm this is to use [parquet-tools] [1] to inspect these files: $ parquet-schema Cheng [1]: https://github.com/apache/parquet

Re: hiveContext.sql NullPointerException

2015-06-07 Thread Cheng Lian
Spark SQL supports Hive dynamic partitioning, so one possible workaround is to create a Hive table partitioned by zone, z, year, and month dynamically, and then insert the whole dataset into it directly. In 1.4, we also provides dynamic partitioning support for non-Hive environment, and you ca

Re: Spark Streaming Stuck After 10mins Issue...

2015-06-07 Thread Cody Koeninger
What is the code used to set up the kafka stream? On Sat, Jun 6, 2015 at 3:23 PM, EH wrote: > And here is the Thread Dump, where seems every worker is waiting for > Executor > #6 Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE) to > be complete: > > Thread 41: BLOCK_MANAGER c

Re: Spark ML decision list

2015-06-07 Thread Debasish Das
What is decision list ? Inorder traversal (or some other traversal) of fitted decision tree On Jun 5, 2015 1:21 AM, "Sateesh Kavuri" wrote: > Is there an existing way in SparkML to convert a decision tree to a > decision list? > > On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh wrote: > >> The close

Re: Accumulator map

2015-06-07 Thread Akhil Das
​Another approach would be to use a zookeeper. If you have zookeeper running somewhere in the cluster you can simply create a path like */dynamic-list*​ in it and then write objects/values to it, you can even create/access nested objects. Thanks Best Regards On Fri, Jun 5, 2015 at 7:06 PM, Cosmin

Re: Monitoring Spark Jobs

2015-06-07 Thread Akhil Das
It could be a CPU, IO, Network bottleneck, you need to figure out where exactly its chocking. You can use certain monitoring utilities (like top) to understand it better. Thanks Best Regards On Sun, Jun 7, 2015 at 4:07 PM, SamyaMaiti wrote: > Hi All, > > I have a Spark SQL application to fetch

Re: Spark Streaming Stuck After 10mins Issue...

2015-06-07 Thread Akhil Das
Which consumer are you using? If you can paste the complete code then may be i can try reproducing it. Thanks Best Regards On Sun, Jun 7, 2015 at 1:53 AM, EH wrote: > And here is the Thread Dump, where seems every worker is waiting for > Executor > #6 Thread 95: sparkExecutor-akka.actor.default

Re: Not understanding manually building EC2 cluster

2015-06-07 Thread Akhil
- Remove localhost from the conf/slaves file, add the slaves private ip. - Make sure master and slave machines are on the same security group (this way all ports will be accessible to all machines) - In conf/spark-env.sh file, place export SPARK_MASTER_IP=MASTER-NODES-PUBLIC-OR-PRIVATE-IP and remo

Re: Problem reading Parquet from 1.2 to 1.3

2015-06-07 Thread Don Drake
Thanks Cheng, we have a workaround in place for Spark 1.3 (remove .metadata directory), good to know it will be resolved in 1.4. -Don On Sun, Jun 7, 2015 at 8:51 AM, Cheng Lian wrote: > This issue has been fixed recently in Spark 1.4 > https://github.com/apache/spark/pull/6581 > > Cheng > > >

Re: Monitoring Spark Jobs

2015-06-07 Thread Otis Gospodnetić
Hi Sam, Have a look at Sematext's SPM for your Spark monitoring needs. If the problem is CPU, IO, Network, etc. as Ahkil mentioned, you'll see that in SPM, too. As for the number of jobs running, you have see a chart with that at http://sematext.com/spm/integrations/spark-monitoring.html Otis --

Driver crash at the end with InvocationTargetException when running SparkPi

2015-06-07 Thread Dong Lei
Hi spark users: After I submitted a SparkPi job to spark, the driver crashed at the end of the job with the following log: WARN EventLoggingListener: Event log dir file:/d:/data/SparkWorker/work/driver-20150607200517-0002/logs/event does not exists, will newly create one. Exception in thread "

Examples of flatMap in dataFrame

2015-06-07 Thread Dimp Bhat
Hi, I'm trying to write a custom transformer in Spark ML and since that uses DataFrames, am trying to use flatMap function in DataFrame class in Java. Can you share a simple example of how to use the flatMap function to do word count on single column of the DataFrame. Thanks Dimple

FlatMap in DataFrame

2015-06-07 Thread dimple
Hi, I'm trying to write a custom transformer in Spark ML and since that uses DataFrames, am trying to use flatMap function in DataFrame class in Java. Can you share a simple example of how to use the flatMap function to do word count on single column of the DataFrame. Thanks. Dimple -- View thi

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread kiran lonikar
Thanks for replying twice :) I think I sent this question by email and somehow thought I did not sent it, hence created the other one on the web interface. Lets retain this thread since you have provided more details here. Great, it confirms my intuition about DataFrame. It's similar to Shark colu

Good Spark consultants?

2015-06-07 Thread jakeheller
I was wondering if there were any consultants in high standing in the community. We are considering using Spark, and we'd love to have someone with a lot of experience help us get up to speed and implement a preexisting data pipeline to use Spark (and perhaps first help answer the question of wheth

How to obtain ActorSystem and/or ActorFlowMaterializer in updateStateByKey

2015-06-07 Thread algermissen1971
Hi, I am writing some code inside an update function for updateStateByKey that flushes data to a remote system using akk-http. For the akka-http request I need an ActorSystem and an ActorFlowMaterializer. Can anyone share a pattern or insights that address the following questions: - Where and

Re: Examples of flatMap in dataFrame

2015-06-07 Thread Ram Sriharsha
Hi You are looking for the explode method (in Dataframe API starting 1.3 I believe) https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1002 Ram On Sun, Jun 7, 2015 at 9:22 PM, Dimp Bhat wrote: > Hi, > I'm trying to write a custom transform

RE: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-07 Thread Haopu Wang
Cheng, thanks for the response. Yes, I was using HiveContext.setConf() to set "dfs.replication". However, I cannot change the value in Hadoop core-site.xml because that will change every HDFS file. I only want to change the replication factor of some specific files. -Original Message- Fro