Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged
into master?
I cannot find spark.sql.hints.broadcastTables in latest master, but it's in
the following patch.
https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5
Jianshi
On Mon, Sep 29, 2014
Thank you, this seems to be the way to go, but unfortunately, when I'm trying
to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting
following Error:
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
14/10/08 06:09:50 INFO input.FileInputFormat: To
Yes, I meant batch interval. Thanks for clarifying.
Cheers,
Michael
On Oct 7, 2014, at 11:14 PM, jayant [via Apache Spark User List]
wrote:
> Hi Michael,
>
> I think you are meaning batch interval instead of windowing. It can be
> helpful for cases when you do not want to process very smal
Hi Michael,
I think you are meaning batch interval instead of windowing. It can be
helpful for cases when you do not want to process very small batch sizes.
HDFS sink in Flume has the concept of rolling files based on time, number
of events or size.
https://flume.apache.org/FlumeUserGuide.html#hd
Liquan, yes, for full outer join, one hash table on both sides is more
efficient.
For the left/right outer join, it looks like one hash table should be enought.
From: Liquan Pei [mailto:liquan...@gmail.com]
Sent: 2014年9月30日 18:34
To: Haopu Wang
Cc: d...@s
How did you run your program? I don't see from your earlier post that
you ever asked for more executors.
On Wed, Oct 8, 2014 at 4:29 AM, Tao Xiao wrote:
> I found the reason why reading HBase is too slow. Although each
> regionserver serves multiple regions for the table I'm reading, the number
Abraham is correct. spark-shell is for typing Spark code into. You
can't give it a .scala file as an argument. spark-submit is for
running a packaged Spark application. You also can't give it a .scala
file. You need to compile a .jar file with your application. This is
true for Spark Streaming too.
Hello,
I was interested in testing Parquet V2 with Spark SQL, but noticed after some
investigation that the parquet writer that Spark SQL uses is fixed at V1 here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L350.
An
Hi, we are using spark 1.1.0 streaming and we are hitting this same issue.
Basically from the job output I saw the following things happen in sequence.
948 14/10/07 18:09:59 INFO storage.BlockManagerInfo: Added
input-0-1412705397200 in memory on ip-10-4-62-85.ec2.internal:59230 (size:
5.3 MB, fr
Hi Andrew,
The use case I have in mind is batch data serialization to HDFS, where sizing
files to a certain HDFS block size is desired. In my particular use case, I
want to process 10GB batches of data at a time. I'm not sure this is a sensible
use case for spark streaming, and I was trying to
Hi Meethu,
I believe you may be hitting a regression in
https://issues.apache.org/jira/browse/SPARK-3633
If you are able, could you please try running a patched version of Spark
1.1.0 that has commit 4fde28c reverted and see if the errors go away?
Posting your results on that bug would be useful
You will need to restart your Mesos workers to pick up the new limits as
well.
On Tue, Oct 7, 2014 at 4:02 PM, Sunny Khatri wrote:
> @SK:
> Make sure ulimit has taken effect as Todd mentioned. You can verify via
> ulimit -a. Also make sure you have proper kernel parameters set in
> /etc/sysctl.c
I found the reason why reading HBase is too slow. Although each
regionserver serves multiple regions for the table I'm reading, the number
of Spark workers allocated by Yarn is too low. Actually, I could see that
the table has dozens of regions spread over about 20 regionservers, but
only two Spar
Here is the hive-site.xml
hive.metastore.local
false
controls whether to connect to remove metastore server
or open a new metastore server in Hive Client JVM
hive.metastore.uris
thrift://*:2513
Remote location of the metastore server
hive.metastore.wareh
Hi,
On Wed, Oct 8, 2014 at 4:50 AM, Josh J wrote:
>
> I have a source which fluctuates in the frequency of streaming tuples. I
> would like to process certain batch counts, rather than batch window
> durations. Is it possible to either
>
> 1) define batch window sizes
>
Cf.
http://apache-spark-u
Jianneng,
On Wed, Oct 8, 2014 at 8:44 AM, Jianneng Li wrote:
>
> I understand that Spark Streaming uses micro-batches to implement
> streaming, while traditional streaming systems use the record-at-a-time
> processing model. The performance benefit of the former is throughput, and
> the latter is
Can we please set the Reply-To header to the mailing list address...
-- Forwarded message --
From: Tobias Pfeiffer
Date: Wed, Oct 8, 2014 at 11:16 AM
Subject: Re: SparkStreaming program does not start
To: spr
Hi,
On Wed, Oct 8, 2014 at 7:47 AM, spr wrote:
> I'm probably doin
Hi
I am new to Spark and trying to develop an application that loads data from
Hive. Here is my setup:
* Spark-1.1.0 (built using -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-Phive)
* Executing Spark-shell on a box with 16 GB RAM
* 4 Cores Single Processor
* OpenCSV library (SerDe)
* Hive table has
So it seems that the classpath issue has been resolved. The
instantiation failure should be related to your hive-site.xml. Would you
mind to create a public gist for your hive-site.xml?
On 10/8/14 4:34 AM, Li HM wrote:
Thanks Cheng.
Here is the error message after a fresh build.
$ mvn -Pyarn
Hi,
I am using the fold(zeroValue)(t1, t2) on the RDD & I noticed that it runs
in parallel on all the partitions & then aggregates the results from the
partitions. My data object is not aggregate-able & I was wondering if
there's any way to run the fold sequentially. [I am looking to do a foldLeft
I have not played around with spark-shell much (especially for spark
streaming), but was just suggesting that spark-submit logs could possibly
tell you whats going on and yes for that you would need to create a jar.
I am not even sure that you can give a .scala file to spark-shell
Usage: ./bin/sp
Hello,
I understand that Spark Streaming uses micro-batches to implement
streaming, while traditional streaming systems use the record-at-a-time
processing model. The performance benefit of the former is throughput, and
the latter is latency. I'm wondering what it would take to implement
record-at
Hi
I think I found a bug in the iPython notebook integration. I am not sure how
to report it
I am running spark-1.1.0-bin-hadoop2.4 on an AWS ec2 cluster. I start the
cluster using the launch script provided by spark
I start iPython notebook on my cluster master as follows and use an ssh
tunnel
Never mind... my bad... made a typo.
looks good.
Thanks,
On Tue, Oct 7, 2014 at 3:57 PM, Abraham Jacob wrote:
> Thanks Sean,
>
> Sorry in my earlier question I meant to type CDH5.1.3 not CDH5.1.2
>
> I presume it's included in spark-streaming_2.10-1.0.0-cdh5.1.3
>
> But for some reason eclipse
|| Try using spark-submit instead of spark-shell
Two questions:
- What does spark-submit do differently from spark-shell that makes you
think that may be the cause of my difficulty?
- When I try spark-submit it complains about "Error: Cannot load main class
from JAR: file:/Users/spr/.../try1.scal
@SK:
Make sure ulimit has taken effect as Todd mentioned. You can verify via
ulimit -a. Also make sure you have proper kernel parameters set in
/etc/sysctl.conf (MacOSX)
On Tue, Oct 7, 2014 at 3:57 PM, Lisonbee, Todd
wrote:
>
> Are you sure the new ulimit has taken effect?
>
> How many cores are
Try using spark-submit instead of spark-shell
On Tue, Oct 7, 2014 at 3:47 PM, spr wrote:
> I'm probably doing something obviously wrong, but I'm not seeing it.
>
> I have the program below (in a file try1.scala), which is similar but not
> identical to the examples.
>
> import org.apache.spark
Are you sure the new ulimit has taken effect?
How many cores are you using? How many reducers?
"In general if a node in your cluster has C assigned cores and you run
a job with X reducers then Spark will open C*X files in parallel and
start writing. Shuffle consolidat
Thanks Sean,
Sorry in my earlier question I meant to type CDH5.1.3 not CDH5.1.2
I presume it's included in spark-streaming_2.10-1.0.0-cdh5.1.3
But for some reason eclipse complains that import
org.apache.spark.streaming.kafka cannon be resolved, even though I have
included the spark-streaming_2.
I'm probably doing something obviously wrong, but I'm not seeing it.
I have the program below (in a file try1.scala), which is similar but not
identical to the examples.
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
println("P
Is it possible to store spark shuffle files on Tachyon ?
Yes, it is the entire Spark distribution.
On Oct 7, 2014 11:36 PM, "Abraham Jacob" wrote:
> Hi All,
>
> Does anyone know if CDH5.1.2 packages the spark streaming kafka connector
> under the spark externals project?
>
>
>
> --
> ~
>
Hi All,
Does anyone know if CDH5.1.2 packages the spark streaming kafka connector
under the spark externals project?
--
~
Did you test different regularization parameters and step sizes? In
the combination that works, I don't see "A + D". Did you test that
combination? Are there any linear dependency between A's columns and
D's columns? -Xiangrui
On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak wrote:
> BTW, one detail:
Hi Steve, what Spark version are you running?
2014-10-07 14:45 GMT-07:00 Steve Lewis :
> java.lang.NullPointerException
> at java.nio.ByteBuffer.wrap(ByteBuffer.java:392)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:54
java.lang.NullPointerException
at java.nio.ByteBuffer.wrap(ByteBuffer.java:392)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at
java.util.concurren
- We set ulimit to 50. But I still get the same "too many open files"
warning.
- I tried setting consolidateFiles to True, but that did not help either.
I am using a Mesos cluster. Does Mesos have any limit on the number of
open files?
thanks
--
View this message in context:
http:/
BTW, one detail:
When number of iterations is 100 all weights are zero or below and the indices
are only from set A.
When number of iterations is 150 I see 30+ non-zero weights (when sorted by
weight) and indices are distributed across al sets. however MSE is high (5.xxx)
and the result does no
Hi All,I have following classes of features:
class A: 15000 featuresclass B: 170 featuresclass C: 900 featuresClass D: 6000
features.
I use linear regression (over sparse data). I get excellent results with low
RMSE (~0.06) for the following combinations of classes:1. A + B + C 2. B + C +
D3. A
Thanks Cheng.
Here is the error message after a fresh build.
$ mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -Phive -DskipTests
clean package
[INFO]
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ...
Hi,
I have a source which fluctuates in the frequency of streaming tuples. I
would like to process certain batch counts, rather than batch window
durations. Is it possible to either
1) define batch window sizes
or
2) dynamically adjust the duration of the sliding window?
Thanks,
Josh
You can create a new Configuration object in something like a
mapPartitions method and use that. It will pick up local Hadoop
configuration from the node, but presumably the Spark workers and HDFS
data nodes are colocated in this case, so the machines have the
correct Hadoop config locally.
On Tue
It looks like you are directly computing the SVM decision function in
both cases:
val predictions2 = m_users_double.map{point=>
point.zip(weights).map(a=> a._1 * a._2).sum + intercept
}.cache()
clf.decision_function(T)
This does not give you +1/-1 in SVMs (well... not for most points,
which wi
The issue is that you're using SQLContext instead of HiveContext. SQLContext
implements a smaller subset of the SQL language and so you're getting a SQL
parse error because it doesn't support the syntax you have. Look at how you'd
write this in HiveQL, and then try doing that with HiveContext.
Not familiar with scikit SVM implementation ( and I assume you are using
linearSVC). To figure out an optimal decision boundary based on the scores
obtained, you can use an ROC curve varying your thresholds.
On Tue, Oct 7, 2014 at 12:08 AM, Adamantios Corais <
adamantios.cor...@gmail.com> wrote:
Try using s3n:// instead of s3 (for the credential configuration as well).
On Tue, Oct 7, 2014 at 9:51 AM, Sunny Khatri wrote:
> Not sure if it's supposed to work. Can you try newAPIHadoopFile() passing
> in the required configuration object.
>
> On Tue, Oct 7, 2014 at 4:20 AM, Tomer Benyamini
I am porting a Hadoop job to Spark - One issue is that the workers need to
read files from hdfs reading a different file based on the key or in some
cases reading an object that is expensive to serialize.
This is easy if the worker has access to the JavaSparkContext (I am
working in Java) but thi
Thanks for the input Soumitra.
On Tue, Oct 7, 2014 at 10:24 AM, Soumitra Kumar
wrote:
> Currently I am not doing anything, if anything change start from scratch.
>
> In general I doubt there are many options to account for schema changes.
> If you are reading files using impala, then it may allo
Currently I am not doing anything, if anything change start from scratch.
In general I doubt there are many options to account for schema changes. If you
are reading files using impala, then it may allow if the schema changes are
append only. Otherwise existing Parquet files have to be migrated
Thanks for the info Soumitra.. its a good start for me.
Just wanted to know how you are managing schema changes/evolution as
parquetSchema is provided to setSchema in the above sample code.
On Tue, Oct 7, 2014 at 10:09 AM, Soumitra Kumar
wrote:
> I have used it to write Parquet files as:
>
> va
Reading the Spark Streaming Programming Guide I found a couple of
interesting points. First of all, while talking about receivers, it says:
*"If the number of cores allocated to the application is less than or equal
to the number of input DStreams / receivers, then the system will receive
data, bu
I have used it to write Parquet files as:
val job = new Job
val conf = job.getConfiguration
conf.set (ParquetOutputFormat.COMPRESSION, CompressionCodecName.SNAPPY.name ())
ExampleOutputFormat.setSchema (job, MessageTypeParser.parseMessageType
(parquetSchema))
rdd saveAsNewAPIHadoopFile (rddToFile
Not sure if it's supposed to work. Can you try newAPIHadoopFile() passing
in the required configuration object.
On Tue, Oct 7, 2014 at 4:20 AM, Tomer Benyamini wrote:
> Hello,
>
> I'm trying to read from s3 using a simple spark java app:
>
> -
>
> SparkConf sparkConf = new Sp
After a bit of looking around, I found saveAsNewAPIHadoopFile could be used
to specify the ParquetOutputFormat. Has anyone used it to convert JSON to
Parquet format or any pointers are welcome, thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-HD
To find root cause, I installed hive 0.12 separately and tried the exact same
test through Hive CLI and it *passed*. So, looks like it is a problem with
Spark-SQL.
Has anybody else faced this issue (Hive-parquet table schema change)??
Should I create JIRA ticket for this?
--
View this message
Maybe sc.wholeTextFile() is what you want, you can get the whole text
and parse it by yourself.
On Tue, Oct 7, 2014 at 1:06 AM, wrote:
> Hi,
>
> I have already unsucesfully asked quiet simmilar question at stackoverflow,
> particularly here:
> http://stackoverflow.com/questions/26202978/spark-an
Hi All
I have one master and one worker on AWS (amazon web service) and am running
spark map reduce code provided on the link
https://spark.apache.org/examples.html
We are using Spark version 1.0.2
Word Count
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split("
Hi, in fact, the same problem happens when I try several joins together:
SELECT *
FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY
INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
eans.FORM_KEY)
py4j.protocol.Py4JJavaError: An error occurred while callin
Hi, the same problem happens when I try several joins together, such as
'SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY
INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
eans.FORM_KEY)'
The error information is as follow:
py4j.protocol.Py4JJavaEr
The build command should be correct. What exact error did you encounter
when trying Spark 1.1 + Hive 0.12 + Hadoop 2.5.0?
On 10/7/14 2:21 PM, Li HM wrote:
Thanks for the replied.
Please refer to my another post entitled "How to make ./bin/spark-sql
work with hive". It has all the error/excepti
Hi David,
Thanks for the reply and effort u put to explain the concepts.Thanks for
example.It worked.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p15844.html
Sent from the Apache Spark User List
Hi again,
Thank you for your suggestion :)
I've tried to implement this method but I'm stuck trying to union the
payload before creating the graph.
Below is a really simplified snippet of what have worked so far.
//Reading the articles given in json format
val articles = sqlContext.jsonFile(pa
Hi all,
My code was working fine in spark 1.0.2 ,but after upgrading to 1.1.0, its
throwing exceptions and tasks are getting failed.
The code contains some map and filter transformations followed by groupByKey
(reduceByKey in another code ). What I could find out is that the code works
fine un
Hello,
I'm trying to read from s3 using a simple spark java app:
-
SparkConf sparkConf = new SparkConf().setAppName("TestApp");
sparkConf.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "XX");
Hello,
I'm trying to read from s3 using a simple spark java app:
-
SparkConf sparkConf = new SparkConf().setAppName("TestApp");
sparkConf.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "XX");
I have found related PRs in the parquet-mr project:
https://github.com/Parquet/parquet-mr/issues/324, however using that version
of the bundle doesn't solve the issue. The issue seems to related to
packaged scope in separate class loaders. I am busy looking for a work
around.
--
View this messa
Try to set --total-executor-cores to limit how many total cores it can use.
Thanks & Regards,
Meethu M
On Thursday, 2 October 2014 2:39 AM, Akshat Aranya wrote:
I guess one way to do so would be to run >1 worker per node, like say, instead
of running 1 worker and giving it 8 cores, you c
Hi,
I have already unsucesfully asked quiet simmilar question at stackoverflow,
particularly here:
http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim.
I've also unsucessfully tryied some workaround, but unsucessfuly, workaround
problem can be f
Hi Landon
I had a problem very similar to your, where we have to process around 5
million relatively small files on NFS. After trying various options, we did
something similar to what Matei suggested.
1) take the original path and find the subdirectories under that path and
then parallelize the
The file itself is for now just wikipedia dump, that can be downloaded from
here
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
It's basically one big .xml that I need to parse in a way to have title + text
on one line of the data. For this I currently use
gens
Hi All,
Continuing on this discussion... Is there a good reason why the def of
"saveAsNewAPIHadoopFiles" in
org/apache/spark/streaming/api/java/JavaPairDStream.scala
is defined like this -
def saveAsNewAPIHadoopFiles(
prefix: String,
suffix: String,
keyClass: Class[_],
val
Hi
Shark supported both the HiveServer1 and HiveServer2 thrift interfaces
(using $ bin/shark -service sharkserver[1 or 2]).
SparkSQL seems to support only HiveServer2. I was wondering what is involved
to add support for HiveServer1. Is this something straightforward to do that
I can embark on mys
Well, apparently, the above Python set-up is wrong. Please consider the
following set-up which DOES use 'linear' kernel... And the question remains
the same: how to interpret Spark results (or why Spark results are NOT
bounded between -1 and 1)?
On Mon, Oct 6, 2014 at 8:35 PM, Sunny Khatri wrote:
73 matches
Mail list logo