Same as you, there are lots of people coming from MapReduce world, and try to
understand the internals of Spark. Hope below can help you some way.
For the end users, they only have concept of Job. I want to run a word count
job from this one big file, that is the job I want to run. How many stage
Take a look about this Jira: https://issues.apache.org/jira/browse/SPARK-6910
Yong
> Date: Mon, 1 Jun 2015 12:26:16 -0700
> From: oke...@gmail.com
> To: user@spark.apache.org
> Subject: SparkSQL's performance gets degraded depending on number of
> partitions of Hive tables..is it normal?
>
>
>
Yes. The HIVE UDF and distribute by both supported by Spark SQL.
If you are using Spark 1.4, you can try Hive analytics windows function
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics),most
of which are already supported in Spark 1.4, so you don't need the
Hi, Spark Users:
I have a problem related to Spark cannot recognize the string type in the
Parquet schema generated by Hive.
Version of all components:
Spark 1.3.1Hive 0.12.0Parquet 1.3.2
I generated a detail low level table in the Parquet format using MapReduce java
code. This table can be read
l binary fields :)
Cheng
On 9/25/15 2:03 PM, java8964 wrote:
Hi, Spark Users:
I have a problem related to Spark cannot recognize the
string type in the Parquet schema generated by Hive.
Version of
Your employee in fact is an array of struct, not just struct.
If you are using HiveSQLContext, then you can refer it like following:
select id from member where employee[0].name = 'employee0'
The employee[0] is pointing to the 1st element of the array.
If you want to query all the elements in the
I don't think you can do that in the Standalone mode before 1.5.
The best you can do is to have multi workers per box. One worker can and will
only start one executor, before Spark 1.5.
What you can do is to set "SPARK_WORKER_INSTANCES", which control how many
worker instances you can start per b
You have 2 options:
Option 1:
Use lateral view explode, as you did below. But if you want to remove the
duplicate, then use distinct after that.
For example:
col1, col2, ArrayOf(Struct)
After explode:
col1, col2, employee0col1, col2, employee1col1, col2, employee0
Then select distinct col1, col2 f
I am not sure about originally explain of shuffle write.
In the word count example, the shuffle is needed, as Spark has to group by the
word (ReduceBy is more accurate here). Image that you have 2 mappers to read
the data, then each mapper will generate the (word, count) tuple output in
segment
he 2000 bytes sent to driver is the final output aggregated on the
reducers end, and merged back to the driver." , which part of our word count
code takes care of this part ? And yes there are only 273 distinct words in the
text so that's not a surprise.
Thanks again,
Hope to get a reply
for every shuffle write , it always writes to disk , what is the meaning
of these properties -
spark.shuffle.memoryFraction
spark.shuffle.spill
Thanks,Kartik
On Fri, Oct 2, 2015 at 6:22 AM, java8964 wrote:
No problem.
>From the mapper side, Spark is very similar as the MapReduce;
You want to implement a custom InputFormat for your MPP, which can provide the
location preference information to Spark.
Yong
> Date: Mon, 5 Oct 2015 10:53:27 -0700
> From: vjan...@sankia.com
> To: user@spark.apache.org
> Subject: Building RDD for a Custom MPP Database
>
> Hi
> I have to build a
Hi, Sparkers:
In this case, I want to use Spark as an ETL engine to load the data from
Cassandra, and save it into HDFS.
Here is the environment specified information:
Spark 1.3.1Cassandra 2.1HDFS/Hadoop 2.2
I am using the Cassandra Spark Connector 1.3.x, which I have no problem to
query the C*
is related:SPARK-10501
On Fri, Oct 9, 2015 at 7:28 AM, java8964 wrote:
Hi, Sparkers:
In this case, I want to use Spark as an ETL engine to load the data from
Cassandra, and save it into HDFS.
Here is the environment specified information:
Spark 1.3.1Cassandra 2.1HDFS/Hadoop 2.2
I am using the
My guess is the same as UDAF of (collect_set) in Hive.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
Yong
From: sliznmail...@gmail.com
Date: Wed, 14 Oct 2015 02:45:48 +
Subject: Re: Spark DataFrame GroupBy into List
To: m
Hi, Sparkers:
I wonder if I can convert a RDD of my own Java class into DataFrame in Spark
1.3.
Here is what I tried to archive, I want to load the data from Cassandra, and
store them into HDFS using either AVRO or Parquet format. I want to test if I
can do this in Spark.
I am using Spark 1.3.1
Not sure the windows function can work for his case.
If you do a "sum() over (partitioned by)", that will return a total sum per
partition, instead of a cumulative sum wanted in this case.
I saw there is a "cume_dis", but no "cume_sum".
Do we really have a "cume_sum" in Spark window function, or a
My mistake. I didn't noticed "UNBOUNDED PRECEDING" already supported.
So cumulative sum should work then.
Thanks
Yong
From: java8...@hotmail.com
To: mich...@databricks.com; deenar.toras...@gmail.com
CC: spanayo...@msn.com; user@spark.apache.org
Subject: RE: Spark SQL running totals
Date: Thu, 15
Maybe you need the Hive part?
Yong
Date: Mon, 26 Oct 2015 11:34:30 -0400
Subject: Problem with make-distribution.sh
From: yana.kadiy...@gmail.com
To: user@spark.apache.org
Hi folks,
building spark instructions
(http://spark.apache.org/docs/latest/building-spark.html) suggest that
./make-distr
Won't you be able to use case statement to generate a virtual column (like
partition_num), then use analytic SQL partition by this virtual column?
In this case, the full dataset will be just scanned once.
Yong
Date: Thu, 29 Oct 2015 10:51:53 -0700
Subject: RDD's filter() or using 'where' conditi
ucket_id=2
..
Am I right?
Thanks
Anfernee
On Thu, Oct 29, 2015 at 11:07 AM, java8964 wrote:
Won't you be able to use case statement to generate a virtual column (like
partition_num), then use analytic SQL partition by this virtual column?
In this case, the full dataset will be just scan
Any reason that Spark Cassandra connector won't work for you?
Yong
To: bryan.jeff...@gmail.com; user@spark.apache.org
From: bryan.jeff...@gmail.com
Subject: RE: Cassandra via SparkSQL/Hive JDBC
Date: Tue, 10 Nov 2015 22:42:13 -0500
Anyone have thoughts or a similar use-case for SparkSQL / Cassand
In my Spark application, I want to access the pass in configuration, but it
doesn't work. How should I do that?
object myCode extends Logging {
// starting point of the application
def main(args: Array[String]): Unit = {
val sparkContext = new SparkContext()
val runtimeEnvironment = sp
prefix. So try something like
--conf spark.runtime.environment=passInValue .
RegardsVarun
On Thu, Nov 12, 2015 at 9:51 PM, java8964 wrote:
In my Spark application, I want to access the pass in configuration, but it
doesn't work. How should I do that?
object myCode extends Logging {
// s
Hi, I have one question related to Spark-Avro, not sure if here is the best
place to ask.
I have the following Scala Case class, populated with the data in the Spark
application, and I tried to save it as AVRO format in the HDFS
case class Claim ( ..)
case class Coupon ( account_id: Long .
Hi, Spark users:
We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production
cluster, which has 42 data/task nodes.
There is one dataset stored as Avro files about 3T. Our business has a complex
query running for the dataset, which is stored in nest structure with Array of
St
...@databricks.com
Date: Fri, 7 Aug 2015 11:32:21 -0700
Subject: Re: Spark SQL query AVRO file
To: java8...@hotmail.com
CC: user@spark.apache.org
Have you considered trying Spark SQL's native support for avro data?
https://github.com/databricks/spark-avro
On Fri, Aug 7, 2015 at 11:30 AM, java8964
it
using HiveQL
CREATE TEMPORARY TABLE episodes
USING com.databricks.spark.avro
OPTIONS (path "src/test/resources/episodes.avro")
On Fri, Aug 7, 2015 at 11:42 AM, java8964 wrote:
Hi, Michael:
I am not sure how spark-avro can help in this case.
My understanding is that to use Spa
Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42
data/task nodes, which runs with BigInsight V3.0.0.2, corresponding with Hadoop
2.2.0 with MR1.
Since IBM BigInsight doesn't come with Spark, so we build Spark 1.2.2 with
Hadoop 2.2.0 + Hive 0.12 by ourselves, and dep
gured log4j)
I think your executors are thrashing or spilling to disk. check memory
metrics/swapping
On 11 August 2015 at 23:19, java8964 wrote:
Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42
data/task nodes, which runs with BigInsight V3.0.0.2, corresponding wit
Hi, This email is sent to both dev and user list, just want to see if someone
familiar with Spark/Maven build procedure can provide any help.
I am building Spark 1.2.2 with the following command:
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-0.12.0
The spark-assembly-1.2.2-hadoop2.2.0.jar
I still want to check if anyone can provide any help related to the Spark 1.2.2
will hang on our production cluster when reading Big HDFS data (7800 avro
blocks), while looks fine for small data (769 avro blocks).
I enable the debug level in the spark log4j, and attached the log file if it
helps
I am comparing the log of Spark line by line between the hanging case (big
dataset) and not hanging case (small dataset).
In the hanging case, the Spark's log looks identical with not hanging case for
reading the first block data from the HDFS.
But after that, starting from line 438 in the spark
l to serve
classes is not responsive. I'd try running outside of the repl and see if that
works.
sorry not a full diagnosis but maybe this'll help a bit.
On Tue, Aug 11, 2015 at 3:19 PM, java8964 wrote:
Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42
da
>From the log, it looks like the OS user who is running spark cannot open any
>more file.
Check your ulimit setting for that user:
ulimit -aopen files (-n) 65536
> Date: Tue, 18 Aug 2015 22:06:04 -0700
> From: swethakasire...@gmail.com
> To: user@spark.apache.org
> Subject: F
Hi, Sparkers:
After first 2 weeks of Spark in our production cluster, with more familiar with
Spark, we are more confident to avoid "Lost Executor" due to memory issue. So
far, most of our jobs won't fail or slow down due to "Lost executor".
But sometimes, I observed that individual tasks failed
The closed information I can found online related to this error
ishttps://issues.apache.org/jira/browse/SPARK-3633
But it is quite different in our case. In our case, we never saw the "(Too many
open files)" error, the log just simple show the 120 sec time out.
I checked all the GC output from al
I believe "spark-shell -i scriptFile" is there. We also use it, at least in
Spark 1.3.1.
"dse spark" will just wrap "spark-shell" command, underline it is just invoking
"spark-shell".
I don't know too much about the original problem though.
Yong
Date: Fri, 21 Aug 2015 18:19:49 +0800
Subject: Re:
What version of Spark you are using, or comes with DSE 4.7?
We just cannot reproduce it in Spark.
yzhang@localhost>$ more test.sparkval pairs =
sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))pairs.reduceByKey((x,y) => x +
y).collectyzhang@localhost>$ ~/spark/bin/spark-shell --master local -i
t
In the test job I am running in Spark 1.3.1 in our stage cluster, I can see
following information on the application stage information:
MetricMin25th percentileMedian75th percentileMaxDuration0 ms1.1 min1.5 min1.7
min3.4 minGC Time11 s16 s21 s25 s54 s
>From the GC output log, I can see it is abou
Did your spark build with Hive?
I met the same problem before because the hive-exec jar in the maven itself
include "protobuf" class, which will be included in the Spark jar.
Yong
Date: Tue, 25 Aug 2015 12:39:46 -0700
Subject: Re: Protobuf error when streaming from Kafka
From: lcas...@gmail.com
T
Hi, On our production environment, we have a unique problems related to Spark
SQL, and I wonder if anyone can give me some idea what is the best way to
handle this.
Our production Hadoop cluster is IBM BigInsight Version 3, which comes with
Hadoop 2.2.0 and Hive 0.12.
Right now, we build spark 1
ssue? Do I need to build spark from source
code?
On Tue, Aug 25, 2015 at 1:06 PM, Cassa L wrote:
I downloaded below binary version of spark.
spark-1.4.1-bin-cdh4
On Tue, Aug 25, 2015 at 1:03 PM, java8964 wrote:
Did your spark build with Hive?
I met the same problem before because the hive-e
What version of the Hive you are using? And do you compile to the right version
of Hive when you compiled Spark?
BTY, spark-avro works great for our experience, but still, some non-tech people
just want to use as a SQL shell in spark, like HIVE-CLI.
Yong
From: mich...@databricks.com
Date: Wed, 2
f this issue might be coz of querying across different schema version
of data ?
ThanksGiri
On Thu, Aug 27, 2015 at 5:39 AM, java8964 wrote:
What version of the Hive you are using? And do you compile to the right version
of Hive when you compiled Spark?
BTY, spark-avro works great for our ex
Or RDD.max() and RDD.min() won't work for you?
Yong
Subject: Re: Calculating Min and Max Values using Spark Transformations?
To: as...@wso2.com
CC: user@spark.apache.org
From: jfc...@us.ibm.com
Date: Fri, 28 Aug 2015 09:28:43 -0700
If you already loaded csv data into a dataframe, why not registe
There are several possibilities here.
1) Keep in mind that 7GB data will need way more than 7G heap, as deserialize
java object needs much more space than data itself. Grand rule is multiple 6 to
8 times, so 7G data need 50G heap space.2) You should monitor the Spark UI, to
check how many record
For text file, this merge works fine, but for binary format like "ORC",
"Parquet" or "AVOR", not sure this will work.
These kind of formats in fact are not append-able, as they write the detail
data information either in the head or at tail part of the file.
You have to use the format specified A
It is a bad idea to use the major version change of protobuf, as it most likely
won't work.
But you really want to give it a try, set the "user classpath first", so the
protobuf 3 coming with your jar will be used.
The setting depends on your deployment mode, check this for the parameter:
https:/
package all my custom
classes and its dependencies including protobuf 3. The problem is how to
configure spark shell to use my uber jar first.
java8964 -- appreciate the link and I will try the configuration. Looks
promising. However, the "user classpath first" attribute does not apply
When you saw this error, does any executor die due to whatever error?
Do you check to see if any executor restarts during your job?
It is hard to help you just with the stack trace. You need to tell us the whole
picture when your jobs are running.
Yong
From: qhz...@apache.org
Date: Tue, 15 Sep 20
ch the block, and after
several retries, the executor just dies with such error.And for your
question, I did not see any executor restart during the job.PS: the
operator I am using during that stage if rdd.glom().mapPartitions()
java8964 于2015年9月15日周二 下午11:44写道:
When you saw
er: key already cancelled ?
sun.nio.ch.selectionkeyi...@3011c7c9java.nio.channels.CancelledKeyException
at
org.apache.spark.network.nio.ConnectionManager.run(ConnectionManager.scala:461)
at
org.apache.spark.network.nio.ConnectionManager$$anon$7.run(ConnectionManager.scala:193)
java8964 于2015年9月16日周三 下午8
Or at least tell us how many partitions you are using.
Yong
> Date: Tue, 22 Sep 2015 02:06:15 -0700
> From: belevts...@gmail.com
> To: user@spark.apache.org
> Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables
>
> Could it be that your data is skewed? Do you have variable-len
Or at least tell us how many partitions you are using.
Yong
> Date: Tue, 22 Sep 2015 02:06:15 -0700
> From: belevts...@gmail.com
> To: user@spark.apache.org
> Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables
>
> Could it be that your data is skewed? Do you have variable-len
Your performance problem sounds like in the driver, which is trying to
boardcast 10k files by itself alone, which becomes the bottle neck.
What you wants is just transfer the data from AVRO format per file to another
format. In MR, most likely each mapper process one file, and you utilized the
w
oductname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",'
') inputlist from landing where dt='2015-9' and userid != '' and userid is
not null and userid
That is interesting.
I don't have any Mesos experience, but just want to know the reason why it does
so.
Yong
> Date: Wed, 23 Sep 2015 15:53:54 -0700
> Subject: Debugging too many files open exception issue in Spark shuffle
> From: dbt...@dbtsai.com
> To: user@spark.apache.org
>
> Hi,
>
> Recen
select userid from landing
where dt='2015-9' and userid != '' and userid is not null and userid is not
NULL and pagetype = 'productDetail' group by userid
""".stripMargin)
@java8964
I tried with sql.shuffle.partitions = 1 but no luck. I
t;,'
') inputlist from landing where
dt='${dateUtil.getYear}-${dateUtil.getMonth}' and day >= '${day}' and userid !=
'' and userid is not null and userid is not NULL and pagetype = 'productDetail'
group by userid
""".stripMa
I am evaluating Spark for our production usage. Our production cluster is
Hadoop 2.2.0 without Yarn. So I want to test Spark with Standalone deployment
running with Hadoop.
What I have in mind is to test a very complex Hive query, which joins between 6
tables, lots of nested structure with explo
Finally I gave up after there are too many failed retry.
>From the log in the worker side, it looks like failed with JVM OOM, as below:
15/02/05 17:02:03 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[Driver Heartbeater,5,main]java.lang.OutOfMemoryError: Java
heap s
pass in a level of parallelism
as second parameter to a suitable operation in your code.
Deb
On Thu, Feb 5, 2015 at 1:03 PM, java8964 wrote:
I am evaluating Spark for our production usage. Our production cluster is
Hadoop 2.2.0 without Yarn. So I want to test Spark with Standalone deployment
Hi, I have some questions about how the spark run the job concurrently.
For example, if I setup the Spark on one standalone test box, which has 24 core
and 64G memory. I setup the Worker memory to 48G, and Executor memory to 4G,
and using spark-shell to run some jobs. Here is something confusing
utor is using. I
> suppose in theory you could write a function that starts its own
> threads too, but that's not generally a good idea or necessary.
>
> Did you read the docs on the site?
> http://spark.apache.org/docs/latest/cluster-overview.html
> http://spark.apache.org/
Hi,
I am using Spark 1.2.0 with Hadoop 2.2. Now I have to 2 csv files, but have 8
fields. I know that the first field from both files are IDs. I want to find all
the IDs existed in the first file, but NOT in the 2nd file.
I am coming with the following code in spark-shell.
case class origAsLeft
OK. I think I have to use "None" instead null, then it works. Still switching
from Java.
I can also just use the field name as what I assume.
Great experience.
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: spark left outer join with java.lang.UnsupportedOperationException:
empty
Hi, Sparkers:
I just happened to search in google for something related to the
RangePartitioner of spark, and found an old thread in this email list as here:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html
I followed the code example mentioned in that email thread
Hi,
I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I
setup a standalone box, with 24 cores and 64G memory.
We have one SQL in mind to test. Here is the basically setup on this one box
for the SQL we are trying to run:
1) Dataset 1, 6.6G AVRO file with snappy compre
Can someone share some ideas about how to tune the GC time?
Thanks
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Spark performance tuning
Date: Fri, 20 Feb 2015 16:04:23 -0500
Hi,
I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I
setup a standalo
Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand some internal
information of spark. From the web and this list, I keep seeing people talking
about increase the parallelism if you get the OOM error. I tried to read
document as much as possible to understand the RDD pa
Anyone can share any thoughts related to my questions?
Thanks
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Help me understand the partition, parallelism in Spark
Date: Wed, 25 Feb 2015 21:58:55 -0500
Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand
kely to
hit an OOM. But there are many other possible sources of OOM, so this is
definitely not the *only* solution.
Sorry I can't comment in particular about Spark SQL -- hopefully somebody more
knowledgeable can comment on that.
On Wed, Feb 25, 2015 at 8:58 PM, java8964 wrote:
Hi, S
Hi, Currently most of the data in our production is using Avro + Snappy. I want
to test the benefits if we store the data in Parquet format. I changed the our
ETL to generate the Parquet format, instead of Avor, and want to test a simple
sql in Spark SQL, to verify the benefits from Parquet.
I g
This is a Java problem, not really Spark.
>From this page:
>http://stackoverflow.com/questions/18520972/converting-java-file-url-to-file-path-platform-independent-including-u
You can see that using Java.nio.* on JDK 7, it will fix this issue. But Path
class in Hadoop will use java.io.*, instead o
sc.textFile(…)?
Ningjun
From: java8964 [mailto:java8...@hotmail.com]
Sent: Monday, March 09, 2015 5:33 PM
To: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: RE: sc.textFile() on windows cannot access UNC path
This is a Java problem, not really Spark.
>From this p
Or another option is to use "Scala-IDE", which is built on top of Eclipse,
instead of pure Eclipse, so Scala comes with it.
Yong
> From: so...@cloudera.com
> Date: Tue, 10 Mar 2015 18:40:44 +
> Subject: Re: Compilation error
> To: mohitanch...@gmail.com
> CC: t...@databricks.com; user@spark.a
You need to include the Hadoop native library in your spark-shell/spark-sql,
assuming your hadoop native library including native snappy library.
spark-sql --driver-library-path point_to_your_hadoop_native_library
In spark-sql, you can just use any command as you are in Hive CLI.
Yong
Date: Wed,
RangePartitioner?
At least for join, you can implement your own partitioner, to utilize the
sorted data.
Just my 2 cents.
Date: Wed, 11 Mar 2015 17:38:04 -0400
Subject: can spark take advantage of ordered data?
From: jcove...@gmail.com
To: User@spark.apache.org
Hello all,
I am wondering if spark
Hi, I am new to Spark. I tried to understand the memory benefits of using
KryoSerializer.
I have this one box standalone test environment, which is 24 cores with 24G
memory. I installed Hadoop 2.2 plus Spark 1.2.0.
I put one text file in the hdfs about 1.2G. Here is the settings in the
spark-en
Here is what I think:
mapPartitions is for a specialized map that is called only once for each
partition. The entire content of the respective partitions is available as a
sequential stream of values via the input argument (Iterarator[T]). The
combined result iterators are automatically converte
I read the Spark code a little bit, trying to understand my own question.
It looks like the different is really between
org.apache.spark.serializer.JavaSerializer and
org.apache.spark.serializer.KryoSerializer, both having the method named
writeObject.
In my test case, for each line of my text f
jects with no circular references. I think
that will improve the performance a little, though I dunno how much.
It might be worth running your experiments again with slightly more complicated
objects and see what you observe.
Imran
On Thu, Mar 19, 2015 at 12:57 PM, java8964 wrote:
I read th
Do you think the ulimit for the user running Spark on your nodes?
Can you run "ulimit -a" under the user who is running spark on the executor
node? Does the result make sense for the data you are trying to process?
Yong
From: szheng.c...@gmail.com
To: user@spark.apache.org
Subject: com.esotericsof
The files sound too small to be 2 blocks in HDFS.
Did you set the defaultParallelism to be 3 in your spark?
Yong
Subject: Re: 2 input paths generate 3 partitions
From: zzh...@hortonworks.com
To: rvern...@gmail.com
CC: user@spark.apache.org
Date: Fri, 27 Mar 2015 23:15:38 +
Hi Rares,
T
I think the jar file has to be local. In HDFS is not supported yet in Spark.
See this answer:
http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs
> Date: Sun, 29 Mar 2015 22:34:46 -0700
> From: n.e.trav...@gmail.com
> To: user@spark.apache.org
> Sub
You can use the HiveContext instead of SQLContext, which should support all the
HiveQL, including lateral view explode.
SQLContext is not supporting that yet.
BTW, nice coding format in the email.
Yong
Date: Tue, 31 Mar 2015 18:18:19 -0400
Subject: Re: SparkSql - java.util.NoSuchElementException:
It is hard to say what could be reason without more detail information. If you
provide some more information, maybe people here can help you better.
1) What is your worker's memory setting? It looks like that your nodes have
128G physical memory each, but what do you specify for the worker's heap
I think implementing your own InputFormat and using SparkContext.hadoopFile()
is the best option for your case.
Yong
From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org
The file has a
Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but I
cannot reproduce it on Spark 1.2.1
If we check the code change below:
Spark 1.3
branchhttps://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
vs
Spark
I tried to check out what Spark SQL 1.3.0. I installed it and following the
online document here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
In the example, it shows something like this:// Select everybody, but increment
the age by 1
df.select("name", df("age") + 1).show()
//
The import command already run.
Forgot the mention, the rest of examples related to "df" all works, just this
one caused problem.
Thanks
Yong
Date: Fri, 3 Apr 2015 10:36:45 +0800
From: fightf...@163.com
To: java8...@hotmail.com; user@spark.apache.org
Subject: Re: Cannot run the example in the Spa
rhttp://polyglotprogramming.com
On Thu, Apr 2, 2015 at 6:53 PM, java8964 wrote:
I think implementing your own InputFormat and using SparkContext.hadoopFile()
is the best option for your case.
Yong
From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into R
cartesian is an expensive operation. If you have 'M' records in location, then
locations. cartesian(locations) will generate MxM result.If locations is a big
RDD, it is hard to do the locations. cartesian(locations) efficiently.Yong
> Date: Tue, 7 Apr 2015 10:04:12 -0700
> From: mas.ha...@gmail.c
It is hard to guess why OOM happens without knowing your application's logic
and the data size.
Without knowing that, I can only guess based on some common experiences:
1) increase "spark.default.parallelism"2) Increase your executor-memory, maybe
6g is not just enough 3) Your environment is kind
If you are using Spark Standalone deployment, make sure you set the
WORKER_MEMROY over 20G, and you do have 20G physical memory.
Yong
> Date: Tue, 7 Apr 2015 20:58:42 -0700
> From: li...@adobe.com
> To: user@spark.apache.org
> Subject: EC2 spark-submit --executor-memory
>
> Dear Spark team,
>
>
Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is almost
only supporting Linux, so UTF-8 is the only encoding supported, as it is the
the one on Linux.
If you have other encoding data, you may want to vote for this
Jira:https://issues.apache.org/jira/browse/MAPREDUCE-232
Yon
Can you try the URI of local file format, something like this:
hiveContext.hql("SET
hive.metastore.warehouse.dir=file:///home/spark/hive/warehouse")
Yong
> Date: Thu, 9 Apr 2015 04:59:00 -0700
> From: inv...@gmail.com
> To: user@spark.apache.org
> Subject: SQL can't not create Hive database
>
> H
If it is really due to data skew, will the task hanging has much bigger Shuffle
Write Size in this case?
In this case, the shuffle write size for that task is 0, and the rest IO of
this task is not much larger than the fast finished tasks, is that normal?
I am also interested in this case, as fro
Really not expert here, but try the following ideas:
1) I assume you are using yarn, then this blog is very good about the resource
tuning:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
2) If 12G is a hard limit in this case, then you have no option but lower yo
1 - 100 of 103 matches
Mail list logo