Thanks Himanshu and RahulKumar!
The databricks forum post was extremely useful. It is great to see an
article that clearly details how and when shuffles are cleaned up.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-tp23279p
Can you try repartitioning the rdd after creating the K,V. And also, while
calling the rdd1.join(rdd2, Pass the # partition argument too)
Thanks
Best Regards
On Wed, Jun 17, 2015 at 12:15 PM, Al M wrote:
> I have 2 RDDs I want to Join. We will call them RDD A and RDD B. RDD A
> has
> 1 billio
Not sure why spark-submit isn't shipping your project jar (may be try with
--jars), You can do a sc.addJar(/path/to/your/project.jar) also, it should
solve it.
Thanks
Best Regards
On Wed, Jun 17, 2015 at 6:37 AM, Yana Kadiyska
wrote:
> Hi folks,
>
> running into a pretty strange issue -- I have
Hi Nathan,
Thanks a lot for the detailed report, especially the information about
nonconsecutive part numbers. It's confirmed to be a race condition bug
and just filed https://issues.apache.org/jira/browse/SPARK-8406 to track
this. Will deliver a fix ASAP and this will be included in 1.4.1.
I have finish training MatrixFactorizationModel, I want to load this model in
spark-streaming.
I think it can be works, but actually not. I don't know why, who can help
me?
I wrote code like this:
val ssc = new StreamingContext ...
val bestModel = MatrixFactorizationModel.load(ssc.sparkC
When you say Storm, did you mean Storm with Trident or Storm?
My use case does not have simple transformation. There are complex events that
need to be generated by joining the incoming event stream.
Also, what do you mean by "No Back PRessure" ?
On Wednesday, 17 June 2015 11:57 AM, Enno
Hi, can somebody suggest me the way to reduce quantity of task?
2015-06-15 18:26 GMT+02:00 Serega Sheypak :
> Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each
> of them has spark worker.
> The problem is that spark runs 869 task to read 3 lines: select bar from
> foo.
>
Thanks Cheng. Nice find!
Let me know if there is anything we can do to help on this end with
contributing a fix or testing.
Side note - any ideas on the 1.4.1 eta? There are a few bug fixes we need in
there.
Cheers,
Nathan
From: Cheng Lian
Date: Wednesday, 17 June 2015 6:25 pm
To: Nathan, "us
The logs have told you what cause the error that you can not invoke RDD
transformations and actions in other transformations. You have not do this
explicitly but the implementation of MatrixFactorizationModel
.recommendProducts
do that, you can refer
https://github.com/apache/spark/blob/master/mlli
I guess both. In terms of syntax, I was comparing it with Trident.
If you are joining, Spark Streaming actually does offer windowed join out
of the box. We couldn't use this though as our event stream can grow
"out-of-sync", so we had to implement something on top of Storm. If your
event streams d
Is there any good sample code in java to implement *Implementing and
Using a Custom Actor-based Receiver .*
--
Thanks & Regards,
Anshu Shukla
Hi,
I downloaded the source from Downloads page and ran the make-distribution.sh
script.
# ./make-distribution.sh --tgz -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests
clean package
The script has “-x” set in the beginning.
++ /tmp/a/spark-1.4.0/build/mvn help:evaluate -Dexpression=project.ve
Hi everyone,
After copying the hive-site.xml from a CDH5 cluster, I can't seem to
connect to the hive metastore using spark-shell, here's a part of the stack
trace I get :
15/06/17 04:41:57 ERROR TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Cause
Hi,
I can’t seem to find any documentation on this feature in 1.4.0?
Regards,
Liming
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Here's one simple spark example that I call RDD#count 2 times. The first
time it would invoke 2 stages, but the second one only need 1 stage. Seems
the first stage is cached. Is that true ? Any flag can I control whether
the cache the intermediate stage
val data = sc.parallelize(1 to 10, 2).m
Hi, spark-sql estimated input for Cassandra table with 3 rows as 8 TB.
sometimes it's estimated as -167B.
I run it on laptop, I don't have 8 TB space for the data.
Hey,
I noticed that my code spends hours with `generateTreeString` even though the
actual dag/dataframe execution takes seconds.
I’m running a query that grows exponential in the number of iterations when
evaluated without caching,
but should be linear when caching previous results.
E.g.
r
Since SPARK-8406 is serious, we hope to ship it ASAP, possibly next
week, but I can't say it's a promise yet. However, you can cherry pick
the commit as soon as the fix is merged into branch-1.4. Sorry for the
troubles!
Cheng
On 6/17/15 1:42 AM, Nathan McCarthy wrote:
Thanks Cheng. Nice find
1.4.0 resolves the problem.
The total classes loaded for an updateStateByKey over Int and String types
does not increase.
The total classes loaded for an updateStateByKey over case classes does
increase over time, but
the processing remains stable. Both memory consumption and CPU load remain
boun
Can you show some code how you're doing the reads? Have you successfully
read other stuff from Cassandra (i.e. do you have a lot of experience with
this path and this particular table is causing issues or are you trying to
figure out the right way to do a read).
What version of Spark and Cassandra
>version
We are on DSE 4.7. (Cassandra 2.1) and spark 1.2.1
>cqlsh
select * from site_users
returns fast, subsecond, only 3 rows
>Can you show some code how you're doing the reads?
dse beeline
!connect ...
select * from site_users
--table has 3 rows, several columns in each row. Spark eunts 769 t
I think
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
might shed some light on the behaviour you’re seeing.
Mark
From: canan chen [mailto:ccn...@gmail.com]
Sent: June-17-15 5:57 AM
To: spark users
Subject: Intermedate stage will be cached automatically ?
Here's on
I am student of telecommunications engineering and this year I worked with
spark. It is a world that I like and want to know if this job having in
this area.
Thanks for all
Regards
My Use case is below
We are going to receive lot of event as stream ( basically Kafka Stream )
and then we need to process and compute
Consider you have a phone contract with ATT and every call / sms / data
useage you do is an event and then it needs to calculate your bill on real
time basis so
Great discussion!!
One qs about some comment: Also, you can do some processing with Kinesis.
If all you need to do is straight forward transformation and you are
reading from Kinesis to begin with, it might be an easier option to just do
the transformation in Kinesis
- Do you mean KCL application
Its not cached per se. For example, you will not see this in Storage tab in
UI. However, spark has read the data and its in memory right now. So, the
next count call should be very fast.
Best
Ayan
On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse wrote:
> I think
> https://spark.apache.org/docs/late
In that case I assume you need exactly once semantics. There's no
out-of-the-box way to do that in Spark. There is updateStateByKey, but it's
not practical with your use case as the state is too large (it'll try to
dump the entire intermediate state on every checkpoint, which would be
prohibitively
Hi Sparkers ,
https://dl.acm.org/citation.cfm?id=2742788
Recently Twitter release a paper on Heron as an replacement of Apache Storm
and i would like to know if currently Apache Spark Does Suffer from the
same issues as they have outlined.
Any input / thought will be helpful.
Thanks,
Ashish
As per my Best Understanding Spark Streaming offer Exactly once processing
, is this achieve only through updateStateByKey or there is another way to
do the same.
Ashish
On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji wrote:
> In that case I assume you need exactly once semantics. There's no
> out
So Spark (not streaming) does offer exactly once. Spark Streaming however,
can only do exactly once semantics *if the update operation is idempotent*.
updateStateByKey's update operation is idempotent, because it completely
replaces the previous state.
So as long as you use Spark streaming, you mu
PS just to elaborate on my first sentence, the reason Spark (not streaming)
can offer exactly once semantics is because its update operation is
idempotent. This is easy to do in a batch context because the input is
finite, but it's harder in streaming context.
On Wed, Jun 17, 2015 at 2:00 PM, Enno
Hi Ayan,
Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
should be able to use their "processor" interface for that. In this
example, it's incrementing a counter:
https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/se
Stream can also be processed in micro-batch / batches which is the main
reason behind Spark Steaming so what is the difference ?
Ashish
On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji wrote:
> PS just to elaborate on my first sentence, the reason Spark (not
> streaming) can offer exactly once sema
Yes, actually on the storage ui, there's no data cached. But the behavior
confuse me. If I call the cache method as following the behavior is the
same as without calling cache method, why's that ?
val data = sc.parallelize(1 to 10, 2).map(e=>(e%2,2)).reduceByKey(_ + _, 2)
data.cache()
println(dat
Cache is more general. ReduceByKey involves a shuffle step where the data
will be in memory and on disk (for what doesn't hold in memory). The
shuffle files will remain around until the end of the job. The blocks from
memory will be dropped if memory is needed for other things. This is an
optimisat
Processing stuff in batch is not the same thing as being transactional. If
you look at Storm, it will e.g. skip tuples that were already applied to a
state to avoid counting stuff twice etc. Spark doesn't come with such
facility, so you could end up counting twice etc.
On Wed, Jun 17, 2015 at 2:
Any ideas what version of Spark is underneath?
i.e. is it 1.4? and is SparkR supported on Amazon EMR?
On Wed, Jun 17, 2015 at 12:06 AM, ayan guha wrote:
> That's great news. Can I assume spark on EMR supports kinesis to hbase
> pipeline?
> On 17 Jun 2015 05:29, "kamatsuoka" wrote:
>
>> Spark i
Hi,
I have a HiveContext which I am using in multiple threads to submit a
Spark SQL query using *sql* method.
I just wanted to know whether this method is thread-safe or not?Will all my
queries be submitted at the same time independent of each other or will be
submitted sequential one after the o
It looks like it is a wrapper around
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark
So basically adding an option -v,1.4.0.a should work.
https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html
2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda :
>
Yes, it is thread safe. That’s how Spark SQL JDBC Server works.
Cheng Hao
From: V Dineshkumar [mailto:developer.dines...@gmail.com]
Sent: Wednesday, June 17, 2015 9:44 PM
To: user@spark.apache.org
Subject: Is HiveContext Thread Safe?
Hi,
I have a HiveContext which I am using in multiple thread
Yes, for now it is a wrapper around the old install-spark BA, but that will
change soon. The currently supported version in AMI 3.8.0 is 1.3.1, as 1.4.0
was released too late to include it in AMI 3.8.0. Spark 1.4.0 support is coming
soon though, of course. Unfortunately, though install-spark is
I've become accustomed to being able to use system properties to override
properties in the Hadoop Configuration objects. I just recently noticed
that when Spark creates the Hadoop Configuraiton in the SparkContext, it
cycles through any properties prefixed with spark.hadoop. and add those
properti
So, there is some input:
So the problem could be in spark-sql-thriftserver.
When I use spark console to submit SQL query, it takes 10 seconds and
reasonable count of tasks.
import com.datastax.spark.connector._;
val cc = new CassandraSQLContext(sc);
cc.sql("select su.user_id from appdata.site_u
Thanks for this. It's kcl based kinesis application. But because its just a
Java application we are thinking to use spark on EMR or storm for fault
tolerance and load balancing. Is it a correct approach?
On 17 Jun 2015 23:07, "Enno Shioji" wrote:
> Hi Ayan,
>
> Admittedly I haven't done much with
Seems you're hitting the self-join, currently Spark SQL won't cache any
result/logical tree for further analyzing or computing for self-join. Since the
logical tree is huge, it's reasonable to take long time in generating its tree
string recursively. And I also doubt the computing can finish wit
AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus
additionally, elastic scaling unlike Storm), Kinesis providing the
coordination. My understanding is that it's like a naked Storm worker
process that can consequently only do map.
I haven't really used it tho, so can't rea
@Enno
As per the latest version and documentation Spark Streaming does offer
exactly once semantics using improved kafka integration , Not i have not
tested yet.
Any feedback will be helpful if anyone is tried the same.
http://koeninger.github.io/kafka-exactly-once/#7
https://databricks.com/blog
What happens is that Spark opens the files so in order to merge the schema.
Unfortunately spark has an assumption that the files are local so that
access would be fast which makes this step in s3 extremely slow.
If you know all the files use the same schema (e.g. it is a result of a
previous job)
The thing is, even with that improvement, you still have to make updates
idempotent or transactional yourself. If you read
http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
that refers to the latest version, it says:
Semantics of output operations
Out
So to be clear, you're trying to use the recommendProducts method of
MatrixFactorizationModel? I don't see predictAll in 1.3.1
1.4.0 has a more efficient method to recommend products for all users (or
vice versa):
https://github.com/apache/spark/blob/v1.4.0/mllib/src/main/scala/org/apache/spark/ml
Thanks to both of you!
You solved the problem.
Thanks
Erik Stensrud
Sendt fra min iPhone
Den 16. jun. 2015 kl. 20.23 skrev Guru Medasani
mailto:gdm...@gmail.com>>:
Hi Esten,
Looks like your sqlContext is connected to a Hadoop/Spark cluster, but the file
path you specified is local?.
mydf<-r
> Seems you're hitting the self-join, currently Spark SQL won't cache any
> result/logical tree for further analyzing or computing for self-join.
Other joins don’t suffer from this problem?
> Since the logical tree is huge, it's reasonable to take long time in
> generating its tree string recu
Actually the reverse.
Spark Streaming is really a micro batch system where the smallest window is 1/2
a second (500ms).
So for CEP, its not really a good idea.
So in terms of options…. spark streaming, storm, samza, akka and others…
Storm is probably the easiest to pick up, spark streaming
How can I get more information regarding this exception?
On Wed, Jun 17, 2015 at 1:17 AM, Saiph Kappa wrote:
> Hi,
>
> I am running a simple spark streaming application on hadoop 2.7.0/YARN
> (master: yarn-client) with 2 executors in different machines. However,
> while the app is running, I can
could it be composed maybe? a general version and then a sql version that
exploits the additional info/abilities available there and uses the general
version internally...
i assume the sql version can benefit from the logical phase optimization to
pick join details. or is there more?
On Tue, Jun
Again, by Storm, you mean Storm Trident, correct?
On Wednesday, 17 June 2015 10:09 PM, Michael Segel
wrote:
Actually the reverse.
Spark Streaming is really a micro batch system where the smallest window is 1/2
a second (500ms). So for CEP, its not really a good idea.
So in terms o
Hi John,
Did you also set spark.sql.planner.externalSort to true? Probably you will
not see executor lost with this conf. For now, maybe you can manually split
the query to two parts, one for skewed keys and one for other records.
Then, you union then results of these two parts together.
Thanks,
So, the second attemp of those tasks failed with NPE can complete and the
job eventually finished?
On Mon, Jun 15, 2015 at 10:37 PM, Night Wolf wrote:
> Hey Yin,
>
> Thanks for the link to the JIRA. I'll add details to it. But I'm able to
> reproduce it, at least in the same shell session, every
Hello shreesh,
That would be quite a challenge to understand.
A few things that I think should help estimate those numbers:
1) Understanding the cost of the individual transformations in the
application
E.g a flatMap can be more expansive in memory as opposed to a map
2) The communication patter
Hi,
I am running this on Spark stand-alone mode. I find that when I examine the
web UI, a couple bugs arise:
1. There is a discrepancy between the number denoting the duration of the
application when I run the history server and the number given by the web UI
(default address is master:8080). I c
This documentation is only for writes to an external system, but all the
counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow
to keep track of a running count) is exactly-once. When you write to a storage
system, no matter which streaming framework you use, you'll have
For 1)
In standalone mode, you can increase the worker's resource allocation in
their local conf/spark-env.sh with the following variables:
SPARK_WORKER_CORES,
SPARK_WORKER_MEMORY
At application submit time, you can tune the number of resource allocated to
executors with /--executor-cores/ and /
Also, still for 1), in conf/spark-defaults.sh, you can give the following
arguments to tune the Driver's resources:
spark.driver.cores
spark.driver.memory
Not sure if you can pass them at submit time, but it should be possible.
--
View this message in context:
http://apache-spark-user-list.10
I don't think this is the same issue as it works just fine in pyspark
v1.3.1.
Are you aware of any workaround? I was hoping to start testing one of my
apps in Spark 1.4 and I use the CSV exports as a safety valve to easily
debug my data flow.
-Don
On Sun, Jun 14, 2015 at 7:18 PM, Burak Yavuz w
Hi Matei,
Ah, can't get more accurate than from the horse's mouth... If you don't
mind helping me understand it correctly..
>From what I understand, Storm Trident does the following (when used with
Kafka):
1) Sit on Kafka Spout and create batches
2) Assign global sequential ID to the batches
3)
The only thing which doesn't make much sense in Spark Streaming (and I am
not saying it is done better in Storm) is the iterative and "redundant"
shipping of the essentially the same tasks (closures/lambdas/functions) to
the cluster nodes AND re-launching them there again and again
This is a l
The major difference is that in Spark Streaming, there's no *need* for a
TridentState for state inside your computation. All the stateful operations
(reduceByWindow, updateStateByKey, etc) automatically handle exactly-once
processing, keeping updates in order, etc. Also, you don't need to run a
Ok what was wrong was that the spark-env did not contain the
HADOOP_CONF_DIR properly set to /etc/hadoop/conf/
With that fixed, this issue is gone, but I can't seem to get Spark SQL
1.4.0 with Hive working on CDH 5.3 or 5.4 :
Using this command line :
IPYTHON=1 /.../spark-1.4.0-bin-hadoop2.4/bin/py
Is it possible to achieve serial batching with Spark Streaming?
Example:
I configure the Streaming Context for creating a batch every 3 seconds.
Processing of the batch #2 takes longer than 3 seconds and creates a
backlog of batches:
batch #1 takes 2s
batch #2 takes 10s
batch #3 takes 2s
batch
I am trying to run a hive query from Spark code using HiveContext object.
It was running fine earlier but since the Apache Sentry has been set
installed the process is failing with this exception :
*org.apache.hadoop.security.AccessControlException: Permission denied:
user=kakn, access=READ_EXECUT
Hi! I would like to know what is the difference between the following
transformations when they are executed right before writing RDD to a file?
1. coalesce(1, shuffle = true)
2. coalesce(1, shuffle = false)
Code example:
val input = sc.textFile(inputFile)
val filtered = input.fi
To add more information beyond what Matei said and answer the original
question, here are other things to consider when comparing between Spark
Streaming and Storm.
* Unified programming model and semantics - Most occasions you have to
process the same data again in batch jobs. If you have two sep
The default behavior should be that batch X + 1 starts processing only
after batch X completes. If you are using Spark 1.4.0, could you show us a
screenshot of the streaming tab, especially the list of batches? And could
you also tell us if you are setting any SparkConf configurations?
On Wed, Jun
I am trying to run a hive query from Spark code using HiveContext object. It
was running fine earlier but since the Apache Sentry has been set installed
the process is failing with this exception :
/org.apache.hadoop.security.AccessControlException: Permission denied:
user=kakn, access=READ_EXECUT
Hi Gajan,
Please subscribe our user mailing list, which is the best place to get
your questions answered. We don't have weighted instance support, but
it should be easy to add and we plan to do it in the next release
(1.5). Thanks for asking!
Best,
Xiangrui
On Wed, Jun 17, 2015 at 2:33 PM, Gajan
So I've seen in the documentation that (after the overhead memory is
subtracted), the memory allocations of each executor are as follows (assume
default settings):
60% for cache
40% for tasks to process data
Reading about how Spark implements shuffling, I've also seen it say "20% of
executor mem
Hi there!
It seems like you have Read/Execute access permission (and no
update/insert/delete access). What operation are you performing?
Ajay
> On Jun 17, 2015, at 5:24 PM, nitinkak001 wrote:
>
> I am trying to run a hive query from Spark code using HiveContext object. It
> was running fine
What's the size of this table? Is the data skewed (so that speculation
is probably triggered)?
Cheng
On 6/15/15 10:37 PM, Night Wolf wrote:
Hey Yin,
Thanks for the link to the JIRA. I'll add details to it. But I'm able
to reproduce it, at least in the same shell session, every time I do a
w
Thanks for reporting this. Would you mind to help creating a JIRA for this?
On 6/16/15 2:25 AM, patcharee wrote:
I found if I move the partitioned columns in schemaString and in Row
to the end of the sequence, then it works correctly...
On 16. juni 2015 11:14, patcharee wrote:
Hi,
I am using
>not being able to read from Kafka using multiple nodes
Kafka is plenty capable of doing this, by clustering together multiple
consumer instances into a consumer group.
If your topic is sufficiently partitioned, the consumer group can consume
the topic in a parallelized fashion.
If it isn't, you
Does increasing executor memory fix the memory problem?
How many columns does the schema contain? Parquet can be super memory
consuming when writing wide tables.
Cheng
On 6/15/15 5:48 AM, Bipin Nag wrote:
HI Davies,
I have tried recent 1.4 and 1.5-snapshot to 1) open the parquet and
save i
On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira wrote:
> Hi,
>
> I am currently experimenting with linear regression (SGD) (Spark + MLlib,
> ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I
> do this (for now) by an exhaustive grid search of the step size and the
> numbe
Please following the code examples from the user guide:
http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark.
-Xiangrui
On Tue, May 26, 2015 at 12:34 AM, Yasemin Kaya wrote:
> Hi,
>
> In CF
>
> String path = "data/mllib/als/test.data";
> JavaRDD data = sc.textFile
That sounds like a bug. Could you create a JIRA and ping Yin Huai
(cc'ed). -Xiangrui
On Wed, May 27, 2015 at 12:57 AM, Justin Yip wrote:
> Hello,
>
> I am trying out 1.4.0 and notice there are some differences in behavior with
> Timestamp between 1.3.1 and 1.4.0.
>
> In 1.3.1, I can compare a Tim
We don't have R-like model summary in MLlib, but we plan to add some
in 1.5. Please watch https://issues.apache.org/jira/browse/SPARK-7674.
-Xiangrui
On Thu, May 28, 2015 at 3:47 PM, rafac wrote:
> I have a simple problem:
> i got mean number of people on one place by hour(time-series like), and
Try to grant read execute access through sentry.
On 18 Jun 2015 05:47, "Nitin kak" wrote:
> I am trying to run a hive query from Spark code using HiveContext object.
> It was running fine earlier but since the Apache Sentry has been set
> installed the process is failing with this exception :
>
>
Hi, here's how to get Parrallel search pipleine:
package org.apache.spark.ml.pipeline
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.sql._
class ParralelGridSearchPipelineextends Pipeline {
override def fit(dataset: Data
Because we don't have random access to the record, sampling still need
to go through the records sequentially. It does save some computation,
which is perhaps noticeable only if you have data cached in memory.
Different random seeds are used for trees. -Xiangrui
On Wed, Jun 3, 2015 at 4:40 PM, And
Hi,
I'm running Spark-1.4.0 on Mesos. I have been trying to read file from
Mapr cluster but not have much success with it. I tried 2 versions of
Apache Spark (with and without Hadoop).
I can get to the spark-shell in the with-hadoop version, but still
can't access maprfs[2]. Without-Hadoop versio
That's is a bug, which will be fixed in
https://github.com/apache/spark/pull/6622. I disabled Model.copy
because models usually doesn't have a default constructor and hence
the default Params.copy implementation won't work. Unfortunately, due
to insufficient test coverage, StringIndexModel.copy is
Hi Hafiz,
As Ewan mentioned, the path is the path to the S3 files unloaded from
Redshift. This is a more scalable way to get a large amount of data
from Redshift than via JDBC. I'd recommend using the SQL API instead
of the Hadoop API (https://github.com/databricks/spark-redshift).
Best,
Xiangrui
There is no plan at this time. We haven't reached 100% coverage on
user-facing API in PySpark yet, which would have higher priority.
-Xiangrui
On Sun, Jun 7, 2015 at 1:42 AM, martingoodson wrote:
> Am I right in thinking that Python mllib does not contain the optimization
> module? Are there plan
I don't fully understand your question. Could you explain it in more
details? Thanks! -Xiangrui
On Mon, Jun 8, 2015 at 2:26 AM, Jean-Charles RISCH <
risch.jeanchar...@gmail.com> wrote:
> Hi,
>
> I am playing with Mllib (Spark 1.3.1) and my auto completion propositions
> don't correspond to the of
Yes. You can apply HashingTF on your input stream and then use
StreamingKMeans for training and prediction. -Xiangrui
On Mon, Jun 8, 2015 at 11:05 AM, Ruslan Dautkhanov wrote:
> Hello,
>
> https://spark.apache.org/docs/latest/mllib-feature-extraction.html
> would Feature Extraction and Transforma
In 1.3, we added some model save/load support in Parquet format. You
can use Parquet's C++ library (https://github.com/Parquet/parquet-cpp)
to load the data back. -Xiangrui
On Wed, Jun 10, 2015 at 12:15 AM, Akhil Das wrote:
> Hope Swig and JNA might help for accessing c++ libraries from Java.
>
>
Hi folks,
I’m looking to deploy spark on YARN and I have read through the docs
(https://spark.apache.org/docs/latest/running-on-yarn.html). One question that
I still have is if there is an alternate means of including your own app jars
as opposed to the process in the “Adding Other Jars” sectio
This is implemented in MLlib:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L41.
-Xiangrui
On Wed, Jun 10, 2015 at 1:53 PM, erisa wrote:
> Hi,
>
> I am a Spark newbie, and trying to solve the same problem, and have
> implement
We don't have it in MLlib. The closest would be the ChiSqSelector,
which works for categorical data. -Xiangrui
On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov wrote:
> What would be closest equivalent in MLLib to Oracle Data Miner's Attribute
> Importance mining function?
>
> http://docs.oracl
Hi,
I'm running Spark on mesos and trying to read file from Maprcluster but not
have much success with that. I tried 2 versions of Apache Spark (with and
without Hadoop).
I can get to the spark-shell in the with-hadoop version, but still can't
access maprfs. Without-Hadoop version bails out with
o
You should add spark-mllib_2.10 as a dependency instead of declaring
it as the artifactId. And always use the same version for spark-core
and spark-mllib. I saw you used 1.3.0 for spark-core but 1.4.0 for
spark-mllib, which is not guaranteed to work. If you set the scope to
"provided", mllib jar wo
1 - 100 of 137 matches
Mail list logo