It really shouldn’t, if anything, running as superuser should ALLOW you to bind
to ports 0, 1 etc.
It seems very strange that it should even be trying to bind to these ports -
maybe a JVM issue?
I wonder if the old Apple JVM implementations could have used some different
native libraries for cor
Hi All,
I have one master & 2 workers on my local machine. I wrote the following
Java program to count number of lines in README.md file (I am using Maven
project to do this)
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.Spa
Is there Graph Partitioning impl (e.g. Spectral ) which can be used in
Spark?
I guess it should be at least java/scala lib
Maybe even tuned to work with GraphX
It should just work with these steps. You don't need to configure much. As
mentioned, some settings on your machine are overriding default spark
settings.
Even running as super-user should not be a problem. It works just fine as
super-user as well.
Can you tell us what version of Java you are usi
Hi,
I have a spark-streaming application that basically keeps track of a
string->string dictionary.
So I have messages coming in with updates, like:
"A"->"B"
And I need to update the dictionary.
This seems like a simple use case for the updateStateByKey method.
However, my issue is that when t
I doubt if thats the problem. Thats how hdfs lists a directory.
Output of few more commands below.
*$ hadoop fs -ls /tmp/*
Found 7 items
drwxrwxrwx - hdfs supergroup 0 2016-03-10 11:09
/tmp/.cloudera_health_monitoring_canary_files
-rw-r--r-- 3 ndsuser1 supergroup 447873024 2016-
+1 Paul. Both have some pros and cons.
Hope this helps.
Avro:
Pros:
1) Plays nice with other tools, 3rd party or otherwise, or you specifically
need some data type in AVRO like binary, but gladly that list is shrinking all
the time (yay nested types in Impala).
2) Good for event data that cha
Hi,
I am doing a few basic operation like map -> reduceByKey -> filter, which is
very similar to world count and I'm saving the result where the count >
threshold.
Currently the batch window is every 10s, but I would like to save the results
to redshift at a lower frequency instead of every
Hi
I'm also having this issue and can not get the tasks to work inside mesos.
In my case, the spark-submit command is the following.
$SPARK_HOME/bin/spark-submit \
--class com.mycompany.SparkStarter \
--master mesos://mesos-dispatcher:7077 \ --name SparkStarterJob \
--driver-memory 1G \
--exe
Still I think this information is not enough to explain the reason.
1. Does your yarn cluster has enough resources to start all 10 executors?
2. Would you please try latest version, 1.6.0 or master branch to see if
this is a bug and already fixed.
3. you could add
"log4j.logger.org.apache.spark.Ex
Sorry,the last configuration is also --conf
spark.dynamicAllocation.cachedExecutorIdleTimeout=60s, "--conf" was lost
when I copied it to mail.
-- Forwarded message --
From: Jy Chen
Date: 2016-03-10 10:09 GMT+08:00
Subject: Re: Dynamic allocation doesn't work on YARN
To: Saisai Sha
Hi,
My Spark version is 1.5.1 with Hadoop 2.5.0-cdh5.2.0. These are my
configurations of dynamic allocation:
--master yarn-client --conf spark.dynamicAllocation.enabled=true --conf
spark.shuffle.service.enabled=true --conf
spark.dynamicAllocation.minExecutors=0 --conf
spark.dynamicAllocation.initia
bq. Question is how to get maven repository
As you may have noted, version has SNAPSHOT in it.
Please checkout latest code from master branch and build it yourself.
2.0 release is still a few months away - though backport of hbase-spark
module should come in 1.3 release.
On Wed, Mar 9, 2016 at 9
bq. epoch2numUDF = udf(foo, FloatType())
Is it possible that return value from foo is not FloatType ?
On Wed, Mar 9, 2016 at 3:09 PM, Andy Davidson wrote:
> I need to convert time stamps into a format I can use with matplotlib
> plot_date(). epoch2num() works fine if I use it in my driver how e
I need to convert time stamps into a format I can use with matplotlib
plot_date(). epoch2num() works fine if I use it in my driver how ever I get
a numpy constructor error if use it in a UDF
Any idea what the problem is?
Thanks
Andy
P.s I am using python3 and spark-1.6
from pyspark.sql.functio
bq. drwxr-xr-x - tomcat7 supergroup 0 2016-03-09 23:16 /tmp/swg
If I read the above line correctly, the size of the file was 0.
On Wed, Mar 9, 2016 at 10:00 AM, srimugunthan dhandapani <
srimugunthan.dhandap...@gmail.com> wrote:
> Hi all
> I am working in cloudera CDH5.6 and version
We ended up implementing custom Hadoop InputFormats and RecordReaders by
extending FileInputFormat / RecordReader, and using sc.newAPIHadoopFile to
read it as an RDD.
On Wed, Mar 9, 2016 at 9:15 AM Ruslan Dautkhanov
wrote:
> We have a huge binary file in a custom serialization format (e.g. heade
bq. there is a varying number of items for that record
If the combination of items is very large, using case class would be
tedious.
On Wed, Mar 9, 2016 at 9:57 AM, Saurabh Bajaj
wrote:
> You can load that binary up as a String RDD, then map over that RDD and
> convert each row to your case cla
That’s very strange. I just un-set my SPARK_HOME env param, downloaded a fresh
1.6.0 tarball,
unzipped it to local dir (~/Downloads), and it ran just fine - the driver port
is some randomly generated large number.
So SPARK_HOME is definitely not needed to run this.
Aida, you are not running thi
Hi Jakob,
Tried running the command env|grep SPARK; nothing comes back
Tried env|grep Spark; which is the directory I created for Spark once I
downloaded the tgz file; comes back with PWD=/Users/aidatefera/Spark
Tried running ./bin/spark-shell ; comes back with same error as below; i.e
could
Sorry had a typo in my previous message:
> try running just "/bin/spark-shell"
please remove the leading slash (/)
On Wed, Mar 9, 2016 at 1:39 PM, Aida Tefera wrote:
> Hi there, tried echo $SPARK_HOME but nothing comes back so I guess I need to
> set it. How would I do that?
>
> Thanks
>
> Sent
As Tristan mentioned, it looks as though Spark is trying to bind on
port 0 and then 1 (which is not allowed). Could it be that some
environment variables from you previous installation attempts are
polluting your configuration?
What does running "env | grep SPARK" show you?
Also, try running just
Hi Tristan, my apologies, I meant to write Spark and not SCALA
I feel a bit lost at the moment...
Perhaps I have missed steps that are implicit to more experienced people
Apart from downloading spark and then following Jakob's steps:
1. curlhttp://apache.arvixe.com/spark/spark-1.6.0/spark-1.6.
SPARK_HOME and SCALA_HOME are different. I was just wondering whether spark is
looking in a different dir for the config files than where you’re running it.
If you have not set SPARK_HOME, it should look in the current directory for the
/conf dir.
The defaults should be relatively safe, I’ve be
I don't think I set the SCALA_HOME environment variable
Also, I'm unsure whether or not I should launch the scripts defaults to a
single machine(local host)
Sent from my iPhone
> On 9 Mar 2016, at 19:59, Tristan Nixon wrote:
>
> Also, do you have the SPARK_HOME environment variable set in you
Also, do you have the SPARK_HOME environment variable set in your shell, and if
so what is it set to?
> On Mar 9, 2016, at 1:53 PM, Tristan Nixon wrote:
>
> There should be a /conf sub-directory wherever you installed spark, which
> contains several configuration files.
> I believe that the tw
There should be a /conf sub-directory wherever you installed spark, which
contains several configuration files.
I believe that the two that you should look at are
spark-defaults.conf
spark-env.sh
> On Mar 9, 2016, at 1:45 PM, Aida Tefera wrote:
>
> Hi Tristan, thanks for your message
>
> When
hive 1.6.0 in embed mode doesn't connect to metastore --
https://issues.apache.org/jira/browse/SPARK-9686
https://forums.databricks.com/questions/6512/spark-160-not-able-to-connect-to-hive-metastore.html
On Wed, Mar 9, 2016 at 10:48 AM, Suniti Singh
wrote:
> Hi,
>
> I am able to reproduce this
Oh yeah I already added it after your earlier message, have a look.
On Wed, Mar 9, 2016 at 7:45 PM, Mohammed Guller wrote:
> My book on Spark was recently published. I would like to request it to be
> added to the Books section on Spark's website.
>
>
>
> Here are the details about the book.
>
>
Hi Tristan, thanks for your message
When I look at the spark-defaults.conf.template it shows a spark
example(spark://master:7077) where the port is 7077
When you say look to the conf scripts, how do you mean?
Sent from my iPhone
> On 9 Mar 2016, at 19:32, Tristan Nixon wrote:
>
> Yeah, accor
My book on Spark was recently published. I would like to request it to be added
to the Books section on Spark's website.
Here are the details about the book.
Title: Big Data Analytics with Spark
Author: Mohammed Guller
Link:
www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/
Yeah, to be clear, I'm talking about having only one constructor for a
direct stream, that will give you a stream of ConsumerRecord.
Different needs for topic subscription, starting offsets, etc could be
handled by calling appropriate methods after construction but before
starting the stream.
On
Yeah, according to the standalone documentation
http://spark.apache.org/docs/latest/spark-standalone.html
the default port should be 7077, which means that something must be overriding
this on your installation - look to the conf scripts!
> On Mar 9, 2016, at 1:26 PM, Tristan Nixon wrote:
>
>
Looks like it’s trying to bind on port 0, then 1.
Often the low-numbered ports are restricted to system processes and
“established” servers (web, ssh, etc.) and
so user programs are prevented from binding on them. The default should be to
run on a high-numbered port like 8080 or such.
What do yo
I'd probably prefer to keep it the way it is, unless it's becoming more
like the function without the messageHandler argument.
Right now I have code like this, but I wish it were more similar looking:
if (parsed.partitions.isEmpty()) {
JavaPairInputDStream kvstream = KafkaUtils
Hi Ted and Saurahb
If I use —conf arguments with pyspark I am able to connect. Any idea how I
can set these values programmatically? (I work on a notebook server and can
not easily reconfigure the server
This works
extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \
--package
Hi Jakob,
Thanks for your suggestion. I downloaded a pre built version with Hadoop and
followed your steps
I posted the result on the forum thread, not sure if you can see it?
I was just wondering whether this means it has been successfully installed as
there are a number of warning/error mess
You can also package an alternative log4j config in your jar files
> On Mar 9, 2016, at 12:20 PM, Ashic Mahtab wrote:
>
> Found it.
>
> You can pass in the jvm parameter log4j.configuration. The following works:
>
> -Dlog4j.configuration=file:path/to/log4j.properties
>
> It doesn't work with
Hi,
I am able to reproduce this error only when using spark 1.6.0 and hive
1.6.0. The hive-site.xml is in the classpath but somehow spark rejects the
classpath search for hive-site.xml and start using the default metastore
Derby.
16/03/09 10:37:52 INFO MetaStoreDirectSql: Using direct SQL, underl
Hi everyone, thanks for all your support
I went with your suggestion Cody/Jakob and downloaded a pre-built version
with Hadoop this time and I think I am finally making some progress :)
ukdrfs01:spark-1.6.0-bin-hadoop2.6 aidatefera$ ./bin/spark-shell --master
local[2]
log4j:WARN No appenders cou
You can also use --files, which doesn't require the file scheme.
On Wed, Mar 9, 2016 at 11:20 AM Ashic Mahtab wrote:
> Found it.
>
> You can pass in the jvm parameter log4j.configuration. The following works:
>
> -Dlog4j.configuration=file:path/to/log4j.properties
>
> It doesn't work without the
Found it.
You can pass in the jvm parameter log4j.configuration. The following works:
-Dlog4j.configuration=file:path/to/log4j.properties
It doesn't work without the file: prefix though. Tested in 1.6.0.
Cheers,Ashic.
From: as...@live.com
To: user@spark.apache.org
Subject: Specify log4j propertie
Could you wrap the ZipInputStream in a List, since a subtype of
TraversableOnce[?] is required?
case (name, content) => List(new ZipInputStream(content.open))
Xinh
On Wed, Mar 9, 2016 at 7:07 AM, Benjamin Kim wrote:
> Hi Sabarish,
>
> I found a similar posting online where I should use the S3
Hi all
I am working in cloudera CDH5.6 and version of spark is 1.5.0-cdh5.6.0
I have a strange problem that spark streaming works on a directory in local
filesystem but doesnt work for hdfs.
My spark streaming program:
package com.oreilly.learningsparkexamples.java;
import java.util.concurren
Hello,Is it possible to provide a log4j properties file when submitting jobs to
a cluster? I know that by default spark looks for a log4j.properties file in
the conf directory. I'm looking for a way to specify a different
log4j.properties file (external to the application) without pointing to a
You can load that binary up as a String RDD, then map over that RDD and
convert each row to your case class representing the data. In the map stage
you could also map the input string into an RDD of JSON values and use the
following function to convert it into a DF
http://spark.apache.org/docs/late
You might want to avoid that unionAll(), which seems to be repeated over
1000 times. Could you do a collect() in each iteration, and collect your
results in a local Array instead of a DataFrame? How many rows are returned
in "temp1"?
Xinh
On Tue, Mar 8, 2016 at 10:00 PM, Angel Angel
wrote:
> He
i tried with avro input, something like /data/txn_*/* and it works for me
On Wed, Mar 9, 2016 at 12:12 PM, Ted Yu wrote:
> Koert:
> I meant org.apache.hadoop.mapred.FileInputFormat doesn't support multi
> level wildcard.
>
> Cheers
>
> On Wed, Mar 9, 2016 at 8:22 AM, Koert Kuipers wrote:
>
>> i
We have a huge binary file in a custom serialization format (e.g. header
tells the length of the record, then there is a varying number of items for
that record). This is produced by an old c++ application.
What would be best approach to deserialize it into a Hive table or a Spark
RDD?
Format is kn
I am trying to integrate SparkStreaming with HBase. I am calling following APIs
to connect to HBase
HConnection hbaseConnection = HConnectionManager.createConnection(conf);
hBaseTable = hbaseConnection.getTable(hbaseTable);
Since I cannot get the connection and broadcast the connection each API
Koert:
I meant org.apache.hadoop.mapred.FileInputFormat doesn't support multi
level wildcard.
Cheers
On Wed, Mar 9, 2016 at 8:22 AM, Koert Kuipers wrote:
> i use multi level wildcard with hadoop fs -ls, which is the exact same
> glob function call
>
> On Wed, Mar 9, 2016 at 9:24 AM, Ted Yu wro
Hi,
We're having a similar issue. We have a standalone cluster running 1.5.2
with Hive working fine having dropped hive-site.xml into the conf folder.
We've just updated to 1.6.0, using the same configuration. Now when
starting a spark-shell we get the following:
java.lang.RuntimeException: java.
Hello,
I'd like to know if this architecture is correct or not. We are studying
Spark as our ETL engine, we have a UI designer for the graph, this give us
a model that we want to translate in the corresponding Spark executions.
What brings to us Akka FSM, Using same sparkContext for all actors, we
My Mistake, It's not Akka FSM is Akka Flow Graphs
On Wed, Mar 9, 2016 at 1:46 PM, Andrés Ivaldi wrote:
> Hello,
>
> I'd like to know if this architecture is correct or not. We are studying
> Spark as our ETL engine, we have a UI designer for the graph, this give us
> a model that we want to tra
Hi
Batch interval is 5min. I actually managed to fix the issue by turning off
dynamic allocation and the external shuffle service.
This seems to have helped and now the scheduling delay is between 0-5ms and
processing time is about 2.8min which is lower than my batch interval.
I also noticed tha
I agree with Ted's assessment.
Big Data space is getting crowded with an amazing array of tools and
utilities some disappearing like meteors. Hadoop is definitely a keeper. So
are Hive and Spark. Hive is the most stable Data Warehouse on Big Data and
Spark is offering an array of impressive tools.
i use multi level wildcard with hadoop fs -ls, which is the exact same glob
function call
On Wed, Mar 9, 2016 at 9:24 AM, Ted Yu wrote:
> Hadoop glob pattern doesn't support multi level wildcard.
>
> Thanks
>
> On Mar 9, 2016, at 6:15 AM, Koert Kuipers wrote:
>
> if its based on HadoopFsRelatio
Hi,
Unless I've misunderstood what you want to achieve, you could use:
sqlContext.read.json(sc.textFile("/mnt/views-p/base/2016/01/*/*-xyz.json"))
Regards,
Christophe.
On 09/03/16 15:24, Ted Yu wrote:
Hadoop glob pattern doesn't support multi level wildcard.
Thanks
On Mar 9, 2016, at 6:15 AM,
Hi guys,
Thanks for responding.
Re SPARK_CLASSPATH (Daoyuan): I think you are right. We tried it, and that’s
what the warning we got said.
Re SparkConf (Daoyuan): We need the custom jar in the driver code, so I don’t
know how that would work.
Re EMR -u (Sonal): The documentation says that thi
bq. it is kind of columnar NoSQL database.
The storage format in HBase is not columnar.
I would suggest you build upon what you already know (Spark and Hive) and
expand on that. Also, if your work uses Big Data technologies, those would
be the first to consider getting to know better.
On Wed, Ma
Hi Sabarish,
I found a similar posting online where I should use the S3 listKeys.
http://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd.
Is this what you were thinking?
And, your assumption is correct. The zipped CSV file contains only a single
file. I f
Hi Gurus,
I am relatively new to Big Data and know some about Spark and Hive.
I was wondering do I need to pick up skills on Hbase as well. I am not sure how
it works but know that it is kind of columnar NoSQL database.
I know it is good to know something new in Big Data space. Just wondering if
Could someone verify this for me?
On 8 March 2016 at 14:06, diplomatic Guru wrote:
> Hello all,
>
> I'm using Random Forest for my machine learning (batch), I would like to
> use online prediction using Streaming job. However, the document only
> states linear algorithm for regression job. Coul
In GraphX partitioning relates to edges not to vertices - vertices are
partitioned however the RDD that was used to create the graph was
partitioned.
-
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-acti
Hadoop glob pattern doesn't support multi level wildcard.
Thanks
> On Mar 9, 2016, at 6:15 AM, Koert Kuipers wrote:
>
> if its based on HadoopFsRelation shouldn't it support it? HadoopFsRelation
> handles globs
>
>> On Wed, Mar 9, 2016 at 8:56 AM, Ted Yu wrote:
>> This is currently not supp
if its based on HadoopFsRelation shouldn't it support it? HadoopFsRelation
handles globs
On Wed, Mar 9, 2016 at 8:56 AM, Ted Yu wrote:
> This is currently not supported.
>
> On Mar 9, 2016, at 4:38 AM, Jakub Liska wrote:
>
> Hey,
>
> is something like this possible?
>
> sqlContext.read.json("/m
This is currently not supported.
> On Mar 9, 2016, at 4:38 AM, Jakub Liska wrote:
>
> Hey,
>
> is something like this possible?
>
> sqlContext.read.json("/mnt/views-p/base/2016/01/*/*-xyz.json")
>
> I switched to DataFrames because my source files changed from TSV to JSON
> but now I'm not
I am not Spark committer.
So I cannot be the shepherd :-)
> On Mar 9, 2016, at 2:27 AM, James Hammerton wrote:
>
> Hi Ted,
>
> Finally got round to creating this:
> https://issues.apache.org/jira/browse/SPARK-13773
>
> I hope you don't mind me selecting you as the shepherd for this ticket.
Hi Daniel
The bottleneck in Spark ML is most likely (a) the fact that the weight
vector itself is dense, and (b) the related communication via the driver. A
tree aggregation mechanism is used for computing gradient sums (see
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apac
Hey,
is something like this possible?
sqlContext.read.json("/mnt/views-p/base/2016/01/*/*-xyz.json")
I switched to DataFrames because my source files changed from TSV to JSON
but now I'm not able to load the files as I did before. I get this error if
I try that :
https://github.com/apache/spark
hi,
What’s your batch interval? if the processing time is constantly bigger
than your batch interval it is totally normal that your scheduling delay is
going up.
2016-03-08 23:28 GMT+01:00 jleaniz :
> Hi,
>
> I have a streaming application that reads batches from Flume, does some
> transformatio
> On 8 Mar 2016, at 18:06, Aida wrote:
>
> Detected Maven Version: 3.0.3 is not in the allowed range 3.3.3.
I'd look at that error message and fix it
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
On 8 Mar 2016, at 16:34, Eddie Esquivel
mailto:eduardo.esqui...@gmail.com>> wrote:
Hello All,
In the Spark documentation under "Hardware Requirements" it very clearly states:
We recommend having 4-8 disks per node, configured without RAID (just as
separate mount points)
My question is why not
Hi Gerhard,
I just stumbled upon some documentation on EMR - link below. Seems there is
a -u option to add jars in S3 to your classpath, have you tried that ?
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html
Best Regards,
Sonal
Founder, Nube Technologie
Hi Ted,
Finally got round to creating this:
https://issues.apache.org/jira/browse/SPARK-13773
I hope you don't mind me selecting you as the shepherd for this ticket.
Regards,
James
On 7 March 2016 at 17:50, James Hammerton wrote:
> Hi Ted,
>
> Thanks for getting back - I realised my mistake
No, the site itself (the part not under docs/) is in the ASF SVN repo.
A PR wouldn't help. Just request here or dev@ with details.
On Wed, Mar 9, 2016 at 7:30 AM, Jan Štěrba wrote:
> You could try creating a pull-request on github.
>
> -Jan
> --
> Jan Sterba
> https://twitter.com/honzasterba | ht
Thanks for the reply.
I am now trying to configure yarn.web-proxy.address according to
https://issues.apache.org/jira/browse/SPARK-5837, but cannot start the
standalone web proxy server.
I am using CDH 5.0.1 and below is the error log:
sbin/yarn-daemon.sh: line 44:
/opt/cloudera/parcels/CDH/lib/
Would you please send out the configurations of dynamic allocation so we
could know better.
On Wed, Mar 9, 2016 at 4:29 PM, Jy Chen wrote:
> Hello everyone:
>
> I'm trying the dynamic allocation in Spark on YARN. I have followed
> configuration steps and started the shuffle service.
>
> Now it c
Hi,
I have non secure Hadoop 2.7.2 cluster on EC2 having Spark 1.5.2
When I am submitting my spark scala script through shell script using Oozie
workflow.
I am submitting job as hdfs user but It is running as user = "yarn" so all
the output should get store under user/yarn directory only .
When I
hi Andrew
Thanks for reply. I run SparkPi in cluster with 1 master+2 slaves based on
YARN, I did not specify the client mode so I think it should in client
mode. I checked the console log and did not find EventLoggingListener
keyword.* Seems the spark-default.conf are not passed correctly. *
The
Oozie may be able to do this for you and integrate with Spark.
> On 09 Mar 2016, at 06:03, Benjamin Kim wrote:
>
> I am wondering if anyone can help.
>
> Our company stores zipped CSV files in S3, which has been a big headache from
> the start. I was wondering if anyone has created a way to i
Hello everyone:
I'm trying the dynamic allocation in Spark on YARN. I have followed
configuration steps and started the shuffle service.
Now it can request executors when the workload is heavy but it cannot
remove executors. I try to open the spark shell and don’t run any command,
no executor is
Spark will skip the stage if it is computed by other jobs. That means the
common parent RDD of each job only needs to be computed once. But it is
still multiple sequential jobs, not concurrent jobs.
On Wed, Mar 9, 2016 at 3:29 PM, Jan Štěrba wrote:
> Hi Andy,
>
> its nice to see that we are not
You can use S3's listKeys API and do a diff between consecutive listKeys to
identify what's new.
Are there multiple files in each zip? Single file archives are processed
just like text as long as it is one of the supported compression formats.
Regards
Sab
On Wed, Mar 9, 2016 at 10:33 AM, Benjami
84 matches
Mail list logo