Consider the following 2 scenarios:
*Scenario #1*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint
pagecounts.count
*Scenario #2*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.count
The total time show in the Spark shell Application UI was different for both
sce
From: Ratika Prasad
Sent: Monday, October 05, 2015 2:39 PM
To: u...@spark.apache.org
Cc: Ameeta Jayarajan
Subject: Spark error while running in spark mode
Hi,
When we run our spark component in cluster mode as below we get the following
error
./bin/spark-submit --class
com.coupons.stream.pr
The missing artifacts are uploaded now. Things should propagate in the next
24 hours. If there are still issues past then ping this thread. Thanks!
- Patrick
On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas wrote:
> Thanks for looking into this Josh.
>
> On Mon, Oct 5, 2015 at 5:39 PM Josh Rose
I meant to say just copy everything to a local hdfs, and then don't use
caching ...
On Mon, Oct 5, 2015 at 4:52 PM, Jegan wrote:
> I am sorry, I didn't understand it completely. Are you suggesting to copy
> the files from S3 to HDFS? Actually, that is what I am doing. I am reading
> the files u
Hi Michael,
Thanks for pointing me the branch. What's the build instructions to build
the hive 1.2.1 release branch for spark 1.5 ?
Weide
On Mon, Oct 5, 2015 at 12:06 PM, Michael Armbrust
wrote:
> I think this is the most up to date branch (used in Spark 1.5):
> https://github.com/pwendell/hiv
I am sorry, I didn't understand it completely. Are you suggesting to copy
the files from S3 to HDFS? Actually, that is what I am doing. I am reading
the files using Spark and persisting it locally.
Or did you actually mean to ask the producer to write the files directly to
HDFS instead of S3? I am
You can write the data to local hdfs (or local disk) and just load it from
there.
On Mon, Oct 5, 2015 at 4:37 PM, Jegan wrote:
> Thanks for your suggestion Ted.
>
> Unfortunately at this point of time I cannot go beyond 1000 partitions. I
> am writing this data to BigQuery and it has a limit of
Thanks for your suggestion Ted.
Unfortunately at this point of time I cannot go beyond 1000 partitions. I
am writing this data to BigQuery and it has a limit of 1000 jobs per day
for a table(they have some limits on this) I currently create 1 load job
per partition. Is there any other work-around
That sounds fine to me, we already do the filtering so populating that
field would be pretty simple.
On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust
wrote:
> We have to try and maintain binary compatibility here, so probably the
> easiest thing to do here would be to add a method to the class.
Could you tell us a way to reproduce this failure? Reading from JSON or Parquet?
On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov
wrote:
> Hi,
>
> We're building our own framework on top of spark and we give users pretty
> complex schema to work with. That requires from us to build dataframes by
>
As a workaround, can you set the number of partitions higher in the
sc.textFile method ?
Cheers
On Mon, Oct 5, 2015 at 3:31 PM, Jegan wrote:
> Hi All,
>
> I am facing the below exception when the size of the file being read in a
> partition is above 2GB. This is apparently because Java's limita
Hi All,
I am facing the below exception when the size of the file being read in a
partition is above 2GB. This is apparently because Java's limitation on
memory mapped files. It supports mapping only 2GB files.
Caused by: java.lang.IllegalArgumentException: Size exceeds
Integer.MAX_VALUE
at s
What happens when a whole node running your " per node streaming engine
(built-in checkpoint and recovery)" fails? Can its checkpoint and recovery
mechanism handle whole node failure? Can you recover from the checkpoint on
a different node?
Spark and Spark Streaming were designed with the idea th
if RDDs from same DStream not guaranteed to run on same worker, then the
question becomes:
is it possible to specify an unlimited duration in ssc to have a continuous
stream (as opposed to discretized).
say, we have a per node streaming engine (built-in checkpoint and recovery)
we'd like to integ
Thanks for looking into this Josh.
On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen wrote:
> I'm working on a fix for this right now. I'm planning to re-run a modified
> copy of the release packaging scripts which will emit only the missing
> artifacts (so we won't upload new artifacts with different S
I'm working on a fix for this right now. I'm planning to re-run a modified
copy of the release packaging scripts which will emit only the missing
artifacts (so we won't upload new artifacts with different SHAs for the
builds which *did* succeed).
I expect to have this finished in the next day or s
Hi all,
I have a process where local mode takes only 40 seconds. While the same on
stand-alone mode, being the same node used for local mode the only available
node, is taking up for ever. rdd actions hang up.
I could only "sort this out" by turning speculation on, so the same task
hanging is
I think this is the most up to date branch (used in Spark 1.5):
https://github.com/pwendell/hive/tree/release-1.2.1-spark
On Mon, Oct 5, 2015 at 1:03 PM, weoccc wrote:
> Hi,
>
> I would like to know where is the spark hive github location where spark
> build depend on ? I was told it used to be
Hi,
I would like to know where is the spark hive github location where spark
build depend on ? I was told it used to be here
https://github.com/pwendell/hive but it seems it is no longer there.
Thanks a lot,
Weide
Thanks Yin, I'll put together a JIRA and a PR tomorrow.
Ewan
-- Original message--
From: Yin Huai
Date: Mon, 5 Oct 2015 17:39
To: Ewan Leith;
Cc: dev@spark.apache.org;
Subject:Re: Dataframe nested schema inference from Json without type conflicts
Hello Ewan,
Adding a JSON-specif
Hello Ewan,
Adding a JSON-specific option makes sense. Can you open a JIRA for this?
Also, sending out a PR will be great. For JSONRelation, I think we can pass
all user-specific options to it (see
org.apache.spark.sql.execution.datasources.json.DefaultSource's
createRelation) just like what we do
I've done some digging today and, as a quick and ugly fix, altering the case
statement of the JSON inferField function in InferSchema.scala
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
to have
case VALUE_ST
Blaž said:
Also missing is
http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
which breaks spark-ec2 script.
This is the package I am referring to in my original email.
Nick said:
It appears that almost every version of Spark up to and including 1.5.0 has
included a —bin
Hi,
We're building our own framework on top of spark and we give users pretty
complex schema to work with. That requires from us to build dataframes by
ourselves: we transform business objects to rows and struct types and uses
these two to create dataframe.
Everything was fine until I started to
Actions trigger jobs. A job is made up of stages. A stage is made up of
tasks. Executor threads execute tasks.
Does that answer your question?
On Mon, Oct 5, 2015 at 12:52 PM, Guna Prasaad wrote:
> What is the difference between a task and a job in spark and
> spark-streaming?
>
> Regards,
> Gu
What is the difference between a task and a job in spark and
spark-streaming?
Regards,
Guna
Also missing is
http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
which breaks spark-ec2 script.
On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu wrote:
> hadoop1 package for Scala 2.10 wasn't in RC1 either:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
27 matches
Mail list logo