Am I right in thinking that Python mllib does not contain the optimization
module? Are there plans to add this to the Python api?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Optimization-module-in-Python-mllib-tp23191.html
Sent from the Apache Spark User
Hi,
This is expected behavior. HiveContext.sql (and also
DataFrame.registerTempTable) is only expected to be invoked on driver
side. However, the closure passed to RDD.foreach is executed on executor
side, where no viable HiveContext instance exists.
Cheng
On 6/7/15 10:06 AM, patcharee wrot
Hi All,
I have a Spark SQL application to fetch data from Hive, on top I have a akka
layer to run multiple Queries in parallel.
*Please suggest a mechanism, so as to figure out the number of spark jobs
running in the cluster at a given instance of time. *
I need to do the above as, I see the ave
On 6/6/15 9:06 AM, James Pirz wrote:
I am pretty new to Spark, and using Spark 1.3.1, I am trying to use
'Spark SQL' to run some SQL scripts, on the cluster. I realized that
for a better performance, it is a good idea to use Parquet files. I
have 2 questions regarding that:
1) If I wanna us
Usually Parquet can be more efficient because of its columnar nature.
Say your table has 10 columns but your join query only touches 3 of
them, Parquet only reads those 3 columns from disk while Avro must load
all data.
Cheng
On 6/5/15 3:00 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
We currently have data in a
Hi,
How can I expect to work on HiveContext on the executor? If only the
driver can see HiveContext, does it mean I have to collect all datasets
(very large) to the driver and use HiveContext there? It will be memory
overload on the driver and fail.
BR,
Patcharee
On 07. juni 2015 11:51, Che
This issue has been fixed recently in Spark 1.4
https://github.com/apache/spark/pull/6581
Cheng
On 6/5/15 12:38 AM, Marcelo Vanzin wrote:
I talked to Don outside the list and he says that he's seeing this
issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like
there is a real i
Are you calling hiveContext.sql within an RDD.map closure or something
similar? In this way, the call actually happens on executor side.
However, HiveContext only exists on the driver side.
Cheng
On 6/4/15 3:45 PM, patcharee wrote:
Hi,
I am using Hive 0.14 and spark 0.13. I got
java.lang.Nu
For the following code:
val df = sqlContext.parquetFile(path)
`df` remains columnar (actually it just reads from the columnar Parquet
file on disk). For the following code:
val cdf = df.cache()
`cdf` is also columnar but that's different from Parquet. When a
DataFrame is cached, Spa
Interesting, just posted on another thread asking exactly the same
question :) My answer there quoted below:
> For the following code:
>
> val df = sqlContext.parquetFile(path)
>
> `df` remains columnar (actually it just reads from the columnar
Parquet file on disk). For the following code:
You can use backticks to quote the column names.
Cheng
On 6/3/15 2:49 AM, David Mitchell wrote:
I am having the same problem reading JSON. There does not seem to be
a way of selecting a field that has a space, "Executor Info" from the
Spark logs.
I suggest that we open a JIRA ticket to ad
Were you using HiveContext.setConf()?
"dfs.replication" is a Hadoop configuration, but setConf() is only used
to set Spark SQL specific configurations. You may either set it in your
Hadoop core-site.xml.
Cheng
On 6/2/15 2:28 PM, Haopu Wang wrote:
Hi,
I'm trying to save SparkSQL DataFrame
Is it possible that some Parquet files of this data set have different
schema as others? Especially those ones reported in the exception messages.
One way to confirm this is to use [parquet-tools] [1] to inspect these
files:
$ parquet-schema
Cheng
[1]: https://github.com/apache/parquet
Spark SQL supports Hive dynamic partitioning, so one possible workaround
is to create a Hive table partitioned by zone, z, year, and month
dynamically, and then insert the whole dataset into it directly.
In 1.4, we also provides dynamic partitioning support for non-Hive
environment, and you ca
What is the code used to set up the kafka stream?
On Sat, Jun 6, 2015 at 3:23 PM, EH wrote:
> And here is the Thread Dump, where seems every worker is waiting for
> Executor
> #6 Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE) to
> be complete:
>
> Thread 41: BLOCK_MANAGER c
What is decision list ? Inorder traversal (or some other traversal) of
fitted decision tree
On Jun 5, 2015 1:21 AM, "Sateesh Kavuri" wrote:
> Is there an existing way in SparkML to convert a decision tree to a
> decision list?
>
> On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh wrote:
>
>> The close
Another approach would be to use a zookeeper. If you have zookeeper
running somewhere in the cluster you can simply create a path like
*/dynamic-list* in it and then write objects/values to it, you can even
create/access nested objects.
Thanks
Best Regards
On Fri, Jun 5, 2015 at 7:06 PM, Cosmin
It could be a CPU, IO, Network bottleneck, you need to figure out where
exactly its chocking. You can use certain monitoring utilities (like top)
to understand it better.
Thanks
Best Regards
On Sun, Jun 7, 2015 at 4:07 PM, SamyaMaiti
wrote:
> Hi All,
>
> I have a Spark SQL application to fetch
Which consumer are you using? If you can paste the complete code then may
be i can try reproducing it.
Thanks
Best Regards
On Sun, Jun 7, 2015 at 1:53 AM, EH wrote:
> And here is the Thread Dump, where seems every worker is waiting for
> Executor
> #6 Thread 95: sparkExecutor-akka.actor.default
- Remove localhost from the conf/slaves file, add the slaves private ip.
- Make sure master and slave machines are on the same security group (this
way all ports will be accessible to all machines)
- In conf/spark-env.sh file, place export
SPARK_MASTER_IP=MASTER-NODES-PUBLIC-OR-PRIVATE-IP and remo
Thanks Cheng, we have a workaround in place for Spark 1.3 (remove
.metadata directory), good to know it will be resolved in 1.4.
-Don
On Sun, Jun 7, 2015 at 8:51 AM, Cheng Lian wrote:
> This issue has been fixed recently in Spark 1.4
> https://github.com/apache/spark/pull/6581
>
> Cheng
>
>
>
Hi Sam,
Have a look at Sematext's SPM for your Spark monitoring needs. If the
problem is CPU, IO, Network, etc. as Ahkil mentioned, you'll see that in
SPM, too.
As for the number of jobs running, you have see a chart with that at
http://sematext.com/spm/integrations/spark-monitoring.html
Otis
--
Hi spark users:
After I submitted a SparkPi job to spark, the driver crashed at the end of the
job with the following log:
WARN EventLoggingListener: Event log dir
file:/d:/data/SparkWorker/work/driver-20150607200517-0002/logs/event does not
exists, will newly create one.
Exception in thread "
Hi,
I'm trying to write a custom transformer in Spark ML and since that uses
DataFrames, am trying to use flatMap function in DataFrame class in Java.
Can you share a simple example of how to use the flatMap function to do
word count on single column of the DataFrame. Thanks
Dimple
Hi,
I'm trying to write a custom transformer in Spark ML and since that uses
DataFrames, am trying to use flatMap function in DataFrame class in Java.
Can you share a simple example of how to use the flatMap function to do word
count on single column of the DataFrame. Thanks.
Dimple
--
View thi
Thanks for replying twice :) I think I sent this question by email and
somehow thought I did not sent it, hence created the other one on the web
interface. Lets retain this thread since you have provided more details
here.
Great, it confirms my intuition about DataFrame. It's similar to Shark
colu
I was wondering if there were any consultants in high standing in the
community. We are considering using Spark, and we'd love to have someone
with a lot of experience help us get up to speed and implement a preexisting
data pipeline to use Spark (and perhaps first help answer the question of
wheth
Hi,
I am writing some code inside an update function for updateStateByKey that
flushes data to a remote system using akk-http.
For the akka-http request I need an ActorSystem and an ActorFlowMaterializer.
Can anyone share a pattern or insights that address the following questions:
- Where and
Hi
You are looking for the explode method (in Dataframe API starting 1.3 I
believe)
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1002
Ram
On Sun, Jun 7, 2015 at 9:22 PM, Dimp Bhat wrote:
> Hi,
> I'm trying to write a custom transform
Cheng, thanks for the response.
Yes, I was using HiveContext.setConf() to set "dfs.replication".
However, I cannot change the value in Hadoop core-site.xml because that
will change every HDFS file.
I only want to change the replication factor of some specific files.
-Original Message-
Fro
30 matches
Mail list logo