Hi,
I've a situation where the number of elements output by each partition from
mapPartitions don't fit into the RAM even with the lowest number of rows in
the partition (there is a hard lower limit on this value). What's the best
way to address this problem? During the mapPartition phase, is ther
I recently encountered similar network related errors and was able to fix
it by applying the ethtool updates decribed here [
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-5085]
On Friday, April 29, 2016, Buntu Dev wrote:
> Just to provide more details, I have 200 blocks (parq
Hi,
I am running a PySpark app with 1000's of cores (partitions is a small
multiple of # of cores) and the overall application performance is fine.
However, I noticed that, at the end of the job, PySpark initiates job
clean-up procedures and as part of this procedure, PySpark executes a job
shown
Hi,
We receive bursts of data with sequential ids and I would like to find the
range for each burst-window. What's the best way to find the "window"
ranges in Spark?
Input
---
1
2
3
4
6
7
8
100
101
102
500
700
701
702
703
704
Output (window start, window end)
---
Hi,
I've a scenario where I need to maintain state that is local to a worker
that can change during map operation. What's the best way to handle this?
*incr = 0*
*def row_index():*
* global incr*
* incr += 1*
* return incr*
*out_rdd = inp_rdd.map(lambda x: row_index()).collect()*
"out_rdd" i
at 6:25 PM, Ted Yu wrote:
> Have you looked at this method ?
>
>* Zips this RDD with its element indices. The ordering is first based
> on the partition index
> ...
> def zipWithIndex(): RDD[(T, Long)] = withScope {
>
> On Wed, Jan 27, 2016 at 6:03 PM, Krishna wrote:
>
&
oring state in external
> NoSQL store such as hbase ?
>
> On Wed, Jan 27, 2016 at 6:37 PM, Krishna wrote:
>
>> Thanks; What I'm looking for is a way to see changes to the state of some
>> variable during map(..) phase.
>> I simplified the scenario in my example by m
gt;
> http://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka
> ,
> it also tells you how to define your own for custom data types.
>
> On Wed, Jan 27, 2016 at 7:22 PM, Krishna > wrote:
> > mapPartitions(...) seems like a good candidate, since
Hi,
what is the most efficient way to perform a group-by operation in Spark and
merge rows into csv?
Here is the current RDD
-
ID STATE
-
1 TX
1NY
1FL
2CA
2OH
-
This is the required output:
--
Hi,
Are there any examples of using JdbcRDD in java available?
Its not clear what is the last argument in this example (
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala
):
sc = new SparkContext("local", "test")
val rdd = new JdbcRDD(
sc,
()
Thanks Kidong. I'll try your approach.
On Tue, Nov 18, 2014 at 4:22 PM, mykidong wrote:
> I had also same problem to use JdbcRDD in java.
> For me, I have written a class in scala to get JdbcRDD, and I call this
> instance from java.
>
> for instance, JdbcRDDWrapper.scala like this:
>
> ...
>
>
Hi,
I am working on a spark job. It takes 10 mins for the job just for the count()
function. Question is How can I make it faster ?
From the above image, what I understood is that there 4001 tasks are running in
parallel. Total tasks are 76,553 .
Here are the parameters that I am using for
You can predict and then zip it with the points RDD to get approx. same as
LP.
Cheers
On Thu, May 21, 2015 at 6:19 PM, anneywarlord
wrote:
> Hello,
>
> New to Spark. I wanted to know if it is possible to use a Labeled Point RDD
> in org.apache.spark.mllib.clustering.KMeans. After I cluster my d
Hi I am trying to write data that is produced from kafka commandline
producer for some topic. I am facing problem and unable to proceed. Below
is my code which I am creating a jar and running through spark-submit on
spark-shell. Am I doing wrong inside foreachRDD() ? What is wrong with
SparkKafkaD
Interesting. Looking at the definitions, sql.functions.pow is defined only
for (col,col). Just as an experiment, create a column with value 2 and see
if that works.
Cheers
On Mon, Jun 29, 2015 at 1:34 PM, Bob Corsaro wrote:
> 1.4 and I did set the second parameter. The DSL works fine but trying
- use .cast("...").alias('...') after the DataFrame is read.
- sql.functions.udf for any domain-specific conversions.
Cheers
On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid
wrote:
> Hi experts!
>
>
> I am using spark-csv to lead csv data into dataframe. By default it makes
> type of each
Error - ImportError: No module named Row
Cheers & enjoy the long weekend
Looks like reduceByKey() should work here.
Cheers
On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna <
leonida.gianfa...@gmail.com> wrote:
> Thanks a lot oubrik,
>
> I got your point, my consideration is that sum() should be already a
> built-in function for iterators in python.
> Anyway I trie
Good question. It comes to computational complexity, computational scale
and data volume.
1. If you can store the data in a single server or a small cluster of db
server (say mysql) then hdfs/Spark might be an overkill
2. If you can run the computation/process the data on a single machine
Hi All,
I am new to this this field, I want to implement new ML algo using Spark
MLlib. What is the procedure.
--
Regards,
Ram Krishna KT
Hi All,
How to add new ML algo in Spark MLlib.
On Fri, Jun 10, 2016 at 12:50 PM, Ram Krishna
wrote:
> Hi All,
>
> I am new to this this field, I want to implement new ML algo using Spark
> MLlib. What is the procedure.
>
> --
> Regards,
> Ram Krishna KT
>
>
>
maller first project building new functionality in
> Spark as a good starting point rather than adding a new algorithm right
> away, since you learn a lot in the process of making your first
> contribution.
>
> On Friday, June 10, 2016, Ram Krishna wrote:
>
>> Hi All,
>>
Hi Robert,
According to the jira the Resolution is wont fix. The pull request was
closed as it did not merge cleanly with the master.
(https://github.com/apache/spark/pull/3222)
On Tue, Jun 14, 2016 at 4:23 PM, Roberto Pagliari wrote:
> Is RBM being developed?
>
> This one is marked as resolved,
andler
not found - continuing with a stub.
Warning:scalac: Class org.jboss.netty.channel.group.ChannelGroup not found
- continuing with a stub.
Warning:scalac: Class com.google.common.collect.ImmutableMap not found -
continuing with a stub.
/Users/krishna/Experiment/spark/external/flume-sink/src/
Hi,
I'm using Spark 1.4.1 (HDP 2.3.2).
As per the spark-csv documentation (https://github.com/databricks/spark-csv),
I see that we can write to a csv file in compressed form using the 'codec'
option.
But, didn't see the support for 'codec' option to read a csv file.
Is there a way to read a compr
Thanks. It works.
On Thu, Jun 16, 2016 at 5:32 PM Hyukjin Kwon wrote:
> It will 'auto-detect' the compression codec by the file extension and then
> will decompress and read it correctly.
>
> Thanks!
>
> 2016-06-16 20:27 GMT+09:00 Vamsi Krishna :
>
>> Hi,
&
;, "SPARK_SUBMIT" -> "true",
"spark.driver.cores" -> "5", "spark.ui.enabled" -> "false",
"spark.driver.supervise" -> "true", "spark.app.name" -> "org.SomeClass",
"spark.jars" -> "file:
Hi,
I'm on HDP 2.3.2 cluster (Spark 1.4.1).
I have a spark streaming app which uses 'textFileStream' to stream simple
CSV files and process.
I see the old data files that are processed are left in the data directory.
What is the right way to purge the old data files in data directory on HDFS?
Tha
Hi all,
Just wanted to thank all for the dataset API - most of the times we see
only bugs in these lists ;o).
- Putting some context, this weekend I was updating the SQL chapters of
my book - it had all the ugliness of SchemaRDD,
registerTempTable, take(10).foreach(println)
and take
Hi Sir,
Please unsubscribe me
--
Regards,
Ram Krishna KT
)
using java
Can any one suggest..
Note: i need to use other than \n bez my data contains \n as part of the
column value.
Thanks & Regards
Radha krishna
o hdfs with the same line separator
(RS[\u001e])
Thanks & Regards
Radha krishna
Hi All,
Please check below for the code and input and output, i think the output is
not correct, i am missing any thing? pls guide
Code
public class Test {
private static JavaSparkContext jsc = null;
private static SQLContext sqlContext = null;
private static Configuration hadoopConf = null;
pu
gt; sqlContext.createDataFrame(empRDD, Emp.class);
>>>> empDF.registerTempTable("EMP");
>>>>
>>>>sqlContext.sql("SELECT * FROM EMP e LEFT OUTER JOIN
>>>> DEPT d ON e.deptid
>>>> = d.deptid").show();
>>>>
>>>>
>>>>
>>>> //empDF.join(deptDF,empDF.col("deptid").equalTo(deptDF.col("deptid")),"leftouter").show();;
>>>>
>>>> }
>>>> catch(Exception e){
>>>> System.out.println(e);
>>>> }
>>>> }
>>>> public static Emp getInstance(String[] parts, Emp emp) throws
>>>> ParseException {
>>>> emp.setId(parts[0]);
>>>> emp.setName(parts[1]);
>>>> emp.setDeptid(parts[2]);
>>>>
>>>> return emp;
>>>> }
>>>> public static Dept getInstanceDept(String[] parts, Dept dept)
>>>> throws
>>>> ParseException {
>>>> dept.setDeptid(parts[0]);
>>>> dept.setDeptname(parts[1]);
>>>> return dept;
>>>> }
>>>> }
>>>>
>>>> Input
>>>> Emp
>>>> 1001 aba 10
>>>> 1002 abs 20
>>>> 1003 abd 10
>>>> 1004 abf 30
>>>> 1005 abg 10
>>>> 1006 abh 20
>>>> 1007 abj 10
>>>> 1008 abk 30
>>>> 1009 abl 20
>>>> 1010 abq 10
>>>>
>>>> Dept
>>>> 10 dev
>>>> 20 Test
>>>> 30 IT
>>>>
>>>> Output
>>>> +--+--++--++
>>>> |deptid|id|name|deptid|deptname|
>>>> +--+--++--++
>>>> |10| 1001| aba|10| dev|
>>>> |10| 1003| abd|10| dev|
>>>> |10| 1005| abg|10| dev|
>>>> |10| 1007| abj|10| dev|
>>>> |10| 1010| abq|10| dev|
>>>> |20| 1002| abs| null|null|
>>>> |20| 1006| abh| null|null|
>>>> |20| 1009| abl| null|null|
>>>> |30| 1004| abf| null|null|
>>>> |30| 1008| abk| null|null|
>>>> +--+--++--++
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Left-outer-Join-issue-using-programmatic-sql-joins-tp27295.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
--
Thanks & Regards
Radha krishna
Hi Mich,
Here I given just a sample data,
I have some GB's of files in HDFS and performing left outer joins on those
files, and the final result I am going to store in Vertica data base table.
There is no duplicate columns in the target table but for the non matching
rows columns I want to insert "
AS|
| 16| |
| 13| UK|
| 14| US|
| 20| As|
| 15| IN|
| 19| IR|
| 11| PK|
+---++
i am expecting the below one any idea, how to apply IS NOT NULL ?
+---++
|_c0|code|
+---++
| 18| AS|
| 13| UK|
| 14| US|
| 20| As|
| 15| IN|
| 19| IR|
| 11| PK|
+---++
Thanks & Regards
Radha krishna
Ok thank you, how to achieve the requirement.
On Sun, Jul 10, 2016 at 8:44 PM, Sean Owen wrote:
> It doesn't look like you have a NULL field, You have a string-value
> field with an empty string.
>
> On Sun, Jul 10, 2016 at 3:19 PM, Radha krishna wrote:
> > Hi All,IS NOT
I want to apply null comparison to a column in sqlcontext.sql, is there any
way to achieve this?
On Jul 10, 2016 8:55 PM, "Radha krishna" wrote:
> Ok thank you, how to achieve the requirement.
>
> On Sun, Jul 10, 2016 at 8:44 PM, Sean Owen wrote:
>
>> It doesn'
Thanks Nick. I also ran into this issue.
VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
then use the dataset for the evaluator. In real life, probably detect the
NaN and recommend most popular on some window.
HTH.
Cheers
On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath
This intrigued me as well.
- Just for sure, I downloaded the 1.6.2 code and recompiled.
- spark-shell and pyspark both show 1.6.2 as expected.
Cheers
On Mon, Jul 25, 2016 at 1:45 AM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:
> Another possible explanation is that by accide
cified by query) value.
I've experimented with using (abusing) Spark Streaming, by streaming
queries and running these against the cached RDD. However, as I say I don't
think that this is an intended use-case of Streaming.
Cheers,
Krishna
more on the use case? It looks a little bit
> like an abuse of Spark in general . Interactive queries that are not
> suitable for in-memory batch processing might be better supported by ignite
> that has in-memory indexes, concept of hot, warm, cold data etc. or hive on
> tez+ll
Hi all,
Is there a method for reading from s3 without having to hard-code keys? The
only 2 ways I've found both require this:
1. Set conf in code e.g.:
sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "")
sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey",
"")
2. Set keys in URL, e.g.:
re is
no dstream level "unpersist"
setting "spark.streaming.unpersist" to true and
streamingContext.remember("duration") did not help. Still seeing out of
memory errors
Krishna
en see out of
memory error
regards
Krishna
On Thu, Feb 18, 2016 at 4:54 AM, Ted Yu wrote:
> bq. streamingContext.remember("duration") did not help
>
> Can you give a bit more detail on the above ?
> Did you mean the job encountered OOME later on ?
>
> Which Spark re
least until convergence)
But am seeing same centers always for the entire duration - ran the app for
several hours with a custom receiver.
Yes I am using the latestModel to predict using "labeled" test data. But
also like to know where my centers are
regards
Krishna
On Fri, Feb 19, 201
, 6706.05424139]
and monitor. please let know if I missed something
Krishna
On Fri, Feb 19, 2016 at 10:59 AM, Bryan Cutler wrote:
> Can you share more of your code to reproduce this issue? The model should
> be updated with each batch, but can't tell what is happening from what y
Also the cluster centroid I get in streaming mode (some with negative
values) do not make sense - if I use the same data and run in batch
KMeans.train(sc.parallelize(parsedData), numClusters, numIterations)
cluster centers are what you would expect.
Krishna
On Fri, Feb 19, 2016 at 12:49 PM
Thanks Guys. Yep, now I would install 1.5.1 over HDP 2.3, if that works.
Cheers
On Mon, Sep 28, 2015 at 9:47 AM, Ted Yu wrote:
> Krishna:
> If you want to query ORC files, see the following JIRA:
>
> [SPARK-10623] [SQL] Fixes ORC predicate push-down
>
> which is in the 1.5.1
A few points to consider:
a) SparkR gives the union of R_in_a_single_machine and the
distributed_computing_of_Spark:
b) It also gives the ability to wrangle with data in R, that is in the
Spark eco system
c) Coming to MLlib, the question is MLlib and R (not MLlib or R) -
depending on the scale, dat
Guys,
- We have HDP 2.3 installed just now. It comes with Spark 1.3.x. The
current wisdom is that it will support the 1.4.x train (which is good, need
DataFrame et al).
- What is the plan to support Spark 1.5.x ? Can we install 1.5.0 on HDP
2.3 ? Or will Spark 1.5.x support be in HD
Hi,
Looks like the test-dataset has different sizes for X & Y. Possible steps:
1. What is the test-data-size ?
- If it is 15,909, check the prediction variable vector - it is now
29,471, should be 15,909
- If you expect it to be 29,471, then the X Matrix is not right.
does the data split and the
datasets where they are allocated to.
Cheers
On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar wrote:
> Hi,
> Looks like the test-dataset has different sizes for X & Y. Possible
> steps:
>
>1. What is the test-data-size ?
> - If i
Hello,
I am a masters student. Could someone please let me know how set up my dev
working environment to contribute to pyspark.
Questions I had were:
a) Should I use Intellij Idea or PyCharm?.
b) How do I test my changes?.
Regards,
Krishna
Unsubscribe
://gist.github.com/krishnakalyan3/08f00f49a943e43600cbc6b21f307228
Could someone please advice on how to go about resolving this error?.
Regards,
Krishna
is a relatively new addition to spark, I was
wondering if structured streaming is stable in production. We were also
evaluating Apache Beam with Flink.
Regards,
Krishna
.
>>>
>>>
>>>
>>> However, I’m unaware of any specific use of streaming with the Spark on
>>> Kubernetes integration right now. Would be curious to get feedback on the
>>> failover behavior right now.
>>>
>>>
>>>
>
Stephen,
Scala 2.11 worked fine for me. Did the dev change and then compile. Not
using in production, but I go back and forth between 2.10 & 2.11.
Cheers
On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman <
stephen.haber...@gmail.com> wrote:
> Hey,
>
> I recently compiled Spark master against
- Divide and conquer with reduceByKey (like Ashish mentioned, each pair
being the key) would work - looks like a "mapReduce with combiners"
problem. I think reduceByKey would use combiners while aggregateByKey
wouldn't.
- Could we optimize this further by using combineByKey directly
1. The RSME varies a little bit between the versions.
2. Partitioned the training,validation,test set like so:
- training = ratings_rdd_01.filter(lambda x: (x[3] % 10) < 6)
- validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) >= 6 and
(x[3] % 10) < 8)
- test = ratin
what is a good number for a
recommendation engine ?
Cheers
On Tue, Feb 24, 2015 at 1:03 AM, Guillaume Charhon <
guilla...@databerries.com> wrote:
> I am using Spark 1.2.1.
>
> Thank you Krishna, I am getting almost the same results as you so it must
> be an error in the tut
Without knowing the data size, computation & storage requirements ... :
- Dual 6 or 8 core machines, 256 GB memory each, 12-15 TB per machine.
Probably 5-10 machines.
- Don't go for the most exotic machines, otoh don't go for cheapest ones
either.
- Find a sweet spot with your ve
Yep the command-option is gone. No big deal, just add the '%pylab
inline' command
as part of your notebook.
Cheers
On Fri, Mar 20, 2015 at 3:45 PM, cong yue wrote:
> Hello :
>
> I tried ipython notebook with the following command in my enviroment.
>
> PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVE
Hi,
Thanks for your response. I modified my code as per your suggestion, but
now I am getting a runtime error. Here's my code:
val df_1 = df.filter( df("event") === 0)
. select("country", "cnt")
val df_2 = df.filter( df("event") === 3)
. select("country", "cnt
. select("country", "cnt").as("b")
> val both = df_2.join(df_1, $"a.country" === $"b.country"), "left_outer")
>
>
>
> On Tue, Mar 24, 2015 at 11:57 PM, S Krishna wrote:
>
>> Hi,
>>
>> Thanks for your
Thanks Olivier. Good work.
Interesting in more than one ways - including training, benchmarking,
testing new releases et al.
One quick question - do you plan to make it available as an S3 bucket ?
Cheers
On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle
wrote:
> Dear Spark users,
>
> I would l
Do you have commons-csv-1.1-bin.jar in your path somewhere ? I had to
download and add this.
Cheers
On Wed, Apr 22, 2015 at 11:01 AM, Mohammed Omer
wrote:
> Afternoon all,
>
> I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via:
>
> `mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipT
One reason could be that the keys are in a different region. Need to create
the keys in us-east-1-North Virginia.
Cheers
On Wed, Jun 4, 2014 at 7:45 AM, Sam Taylor Steyer
wrote:
> Hi,
>
> I am trying to launch an EC2 cluster from spark using the following
> command:
>
> ./spark-ec2 -k HackerPa
curity group rules must specify protocols
> explicitly.7ff92687-b95a-4a39-94cb-e2d00a6928fd
>
> This sounds like it could have to do with the access settings of the
> security group, but I don't know how to change. Any advice would be much
> appreciated!
>
> Sam
>
> ---
Shahab,
Interesting question. Couple of points (based on the information from
your e-mail)
1. One can support the use case in Spark as a set of transformations on
a WIP TDD over a span of time and the final transformation outputting to a
processed TDD
- Spark streaming would be a
Project->Properties->Java Build Path->Add External Jars
Add the /spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
Cheers
On Sun, Jun 8, 2014 at 8:06 AM, Carter wrote:
> Hi All,
>
> I just downloaded the Scala IDE for Eclipse. After I created a Spark
> project
> and clicked "Run
Yep, it gives tons of errors. I was able to make it work with sudo. Looks
like ownership issue.
Cheers
On Tue, Jun 10, 2014 at 6:29 PM, zhen wrote:
> I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I
> do not seem to be able to start the history server on the master n
Hi,
Would appreciate insights and wisdom on a problem we are working on:
1. Context:
- Given a csv file like:
- d1,c1,a1
- d1,c1,a2
- d1,c2,a1
- d1,c1,a1
- d2,c1,a3
- d2,c2,a1
- d3,c1,a1
- d3,c3,a1
- d3,c2,a1
- d3,c3,a2
Answered one of my questions (#5) : val pairs = new PairRDDFunctions()
works fine locally. Now I can do groupByKey et al. Am not sure if it is
scalable for millions of records & memory efficient.
heers
On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar wrote:
> Hi,
>Would appreciat
And got the first cut:
val res = pairs.groupByKey().map((x) => (x._1, x._2.size, x._2.toSet.size))
gives the total & unique.
The question : is it scalable & efficient ? Would appreciate insights.
Cheers
On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar
wrote:
> Answ
Ian,
Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a
wrapper around the
com.clearspring.analytics.stream.cardinality.HyperLogLogPlus.
Cheers
On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell wrote:
> Depending on your requirements when doing hourly metrics calculating
>
Mahesh,
- One direction could be : create a parquet schema, convert & save the
records to hdfs.
- This might help
https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala
Cheers
On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc
Santhosh,
All the nodes should have access to the jar file and so the classpath
should be on all the nodes. Usually it is better to rsync the conf
directory to all nodes rather than editing them separately.
Cheers
On Tue, Jun 17, 2014 at 9:26 PM, santhoma wrote:
> Hi,
>
> This is about spar
Hi,
- I have seen similar behavior before. As far as I can tell, the root
cause is the out of memory error - verified this by monitoring the memory.
- I had a 30 GB file and was running on a single machine with 16GB.
So I knew it would fail.
- But instead of raising an exce
- After loading large RDDs that are > 60-70% of the total memory, (k,v)
operations like finding uniques/distinct, GroupByKey and SetOperations
would be network bound.
- A multi-stage Map-Reduce DAG should be a good test. When we tried this
for Hadoop, we used examples from Genomics.
Konstantin,
1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2,
the nodes would have hdfs & YARN.
2. Then you need to install Spark on all nodes. I haven't had experience
with HDP, but the tech preview might have installed Spark as well.
3. In the end, one should
Couldn't find any reference of CDH in pom.xml - profiles or the
hadoop.version.Am also wondering how the cdh compatible artifact was
compiled.
Cheers
On Mon, Jul 7, 2014 at 8:07 PM, Srikrishna S
wrote:
> Hi All,
>
> Does anyone know what the command line arguments to mvn are to generate
> the
Abel,
I rsync the spark-1.0.1 directory to all the nodes. Then whenever the
configuration changes, rsync the conf directory.
Cheers
On Wed, Jul 9, 2014 at 2:06 PM, Abel Coronado Iruegas <
acoronadoirue...@gmail.com> wrote:
> Hi everybody
>
> We have hortonworks cluster with many nodes, we wa
Nick,
AFAIK, you can compile with yarn=true and still run spark in stand alone
cluster mode.
Cheers
On Wed, Jul 9, 2014 at 9:27 AM, Nick R. Katsipoulakis
wrote:
> Hello,
>
> I am currently learning Apache Spark and I want to see how it integrates
> with an existing Hadoop Cluster.
>
> My cu
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in
all the nodes irrespective of Hadoop/YARN.
Cheers
On Tue, Jul 8, 2014 at 6:24 PM, Robert James wrote:
> I have a Spark app which runs well on local master. I'm now ready to
> put it on a cluster. What needs to be ins
One vector to check is the HBase libraries in the --jars as in :
spark-submit --class --master --jars
hbase-client-0.98.3-hadoop2.jar,commons-csv-1.0-SNAPSHOT.jar,hbase-common-0.98.3-hadoop2.jar,hbase-hadoop2-compat-0.98.3-hadoop2.jar,hbase-it-0.98.3-hadoop2.jar,hbase-protocol-0.98.3-hadoop2.jar,
not if you have other configurations that might have conflicted with
>>> those.
>>>
>>> Could you try the following, remove anything that is spark specific
>>> leaving only hbase related codes. uber jar it and run it just like any
>>> other simple java
Probably you have - if not, try a very simple app in the docker container
and make sure it works. Sometimes resource contention/allocation can get in
the way. This happened to me in the YARN container.
Also try single worker thread.
Cheers
On Sat, Jul 19, 2014 at 2:39 PM, boci wrote:
> Hi guys
- IMHO, #2 is preferred as it could work in any environment (Mesos,
Standalone et al). While Spark needs HDFS (for any decent distributed
system) YARN is not required at all - Meson is a lot better.
- Also managing the app with appropriate bootstrap/deployment framework
is more flexi
I specified as follows:
spark.eventLog.dir /mapr/spark_io
We use mapr fs for sharing files. I did not provide an ip address or port
number - just the directory name on the shared filesystem.
On Aug 29, 2014 8:28 AM, "Brad Miller" wrote:
> How did you specify the HDFS path? When i put
>
> spark
Setting spark.mesos.coarse=true helped reduce the time on the mesos cluster
from 17 min to around 6 min. The scheduler delay per task reduced from 40
ms to around 10 ms.
thanks
On Mon, Sep 22, 2014 at 12:36 PM, Xiangrui Meng wrote:
> 1) MLlib 1.1 should be faster than 1.0 in general. What's th
Guys,
- Need help in terms of the interesting features coming up in MLlib 1.2.
- I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con
- "The Hitchhiker's Guide to Machine Learning with Python & Apache
Spark"[2]
- At minimum, it would be good to take the last 30 mi
Thanks Xiangrui. Appreciate the insights.
I have uploaded the initial version of my presentation at
http://goo.gl/1nBD8N
Cheers
On Mon, Sep 29, 2014 at 12:17 AM, Xiangrui Meng wrote:
> Hi Krishna,
>
> Some planned features for MLlib 1.2 can be found via Spark JIRA:
> http://bi
Guys,
Obviously I am doing something wrong. May be 4 points are too small a
dataset.
Can you help me to figure out why the following doesn't work ?
a) This works :
data = [
LabeledPoint(0.0, [0.0]),
LabeledPoint(10.0, [10.0]),
LabeledPoint(20.0, [20.0]),
LabeledPoint(30.0, [30.0])
]
be 0.1 or 0.01?
>
> Best,
> Burak
>
> - Original Message -
> From: "Krishna Sankar"
> To: user@spark.apache.org
> Sent: Wednesday, October 1, 2014 12:43:20 PM
> Subject: MLlib Linear Regression Mismatch
>
> Guys,
>Obviously I am doing some
Hi,
I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia.
This step only adds the support for Hadoop,Yarn,Hive et al in the spark
executable.No need to run if one is not using them.
Cheers
On Thu, Oct 2, 2014 at 12:29 PM, danilopds wrote:
> Hi tsingfu,
>
> I want to see me
Well done guys. MapReduce sort at that time was a good feat and Spark now
has raised the bar with the ability to sort a PB.
Like some of the folks in the list, a summary of what worked (and didn't)
as well as the monitoring practices would be good.
Cheers
P.S: What are you folks planning next ?
O
1 - 100 of 125 matches
Mail list logo