One thing to note, if you are using Mesos, is that the version of Mesos
changed from 0.21 to 1.0.0. So taking a newer Spark might push you into
larger infrastructure upgrades
On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D
wrote:
> Hello All,
>
> Currently our Batch ETL Jobs are in Spark 1.6.
Hi folks, I have created a table in the following manner:
CREATE EXTERNAL TABLE IF NOT EXISTS rum_beacon_partition (
list of columns
)
COMMENT 'User Infomation'
PARTITIONED BY (account_id String,
product String,
group_id String,
year String,
month String,
day String)
STORED AS
Hi folks, trying to run Spark 2.1.0 thrift server against an hsqldb file
and it seems to...hang.
I am starting thrift server with:
sbin/start-thriftserver.sh --driver-class-path ./conf/hsqldb-2.3.4.jar ,
completely local setup
hive-site.xml is like this:
hive.metastore.warehouse.d
Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When
spark reads these files, I end up with 315K tasks for a dataframe reading a
few days worth of data.
I now with a regular Spark job, I can use coalesce to come to a lower
number of tasks. Is there a way to tell HiveThriftserver
e were 2 different version of derby and ensuring
> the metastore and spark used the same version of Derby made the problem go
> away.
>
> Deenar
>
> On 6 January 2016 at 02:55, Yana Kadiyska wrote:
>
>> Deenar, I have not resolved this issue. Why do you think it's fr
ly it is down to different versions of derby in the classpath, but
> i am unsure where the other version is coming from. The setup worked
> perfectly with spark 1.3.1.
>
> Deenar
>
> On 20 December 2015 at 04:41, Deenar Toraskar
> wrote:
>
>> Hi Yana/All
>>
>
et, can you confirm that? If it fall into this
> category, probably you can set the
> “spark.sql.thriftServer.incrementalCollect” to false;
>
>
>
> Hao
>
>
>
> *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
> *Sent:* Friday, November 13, 2015 8:30 AM
> *To
Hi folks, I'm starting a HiveServer2 from a HiveContext
(HiveThriftServer2.startWithContext(hiveContext)) and then connecting to
it via beenline. On the server side, I see the below error which I think
is related to https://issues.apache.org/jira/browse/HIVE-6468
But I'd like to know:
1. why I
subtract is not the issue. Spark is lazy so a lot of times you'd have many,
many lines of code which does not in fact run until you do some action (in
your case, subtract). As you can see from the stacktrace, the NPE is from
joda which is used in the partitioner (Im suspecting in Cassandra).But the
```
I then try to run spark-shell thusly:
bin/spark-shell --driver-class-path
/home/yana/db-derby-10.12.1.1-bin/lib/derbyclient.jar
and I get an ugly stack trace like so...
Caused by: java.lang.NoClassDefFoundError: Could not initialize class
org.apache.derby.jdbc.EmbeddedDriver
at
gt; +---+-+---+-+
> |customer_id| uri|browser|epoch|
> +---+-+---+-+
> |999|http://foobar|firefox| 1234|
> |888|http://foobar| ie|12343|
> +---+-+---+-+
>
> Cheers
>
> On Fri, Oct 30, 2015 at 12:11 PM, Ya
Hi folks,
I have a need to "append" two dataframes -- I was hoping to use UnionAll
but it seems that this operation treats the underlying dataframes as
sequence of columns, rather than a map.
In particular, my problem is that the columns in the two DFs are not in the
same order --notice that my c
For this issue in particular ( ERROR XSDB6: Another instance of Derby may
have already booted the database /spark/spark-1.4.1/metastore_db) -- I
think it depends on where you start your application and HiveThriftserver
from. I've run into a similar issue running a driver app first, which would
crea
.10.jar
>> -rw-r--r-- hbase/hadoop339666 2015-10-26 09:52
>> spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-api-jdo-3.2.6.jar
>> -rw-r--r-- hbase/hadoop 1809447 2015-10-26 09:52
>> spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-rdbms-3.2.9.jar
>>
&g
In 1.4 ./make_distribution produces a .tgz file in the root directory (same
directory that make_distribution is in)
On Mon, Oct 26, 2015 at 8:46 AM, Kayode Odeyemi wrote:
> Hi,
>
> The ./make_distribution task completed. However, I can't seem to locate the
> .tar.gz file.
>
> Where does Spark
thank you so much! You are correct. This is the second time I've made this
mistake :(
On Mon, Oct 26, 2015 at 11:36 AM, java8964 wrote:
> Maybe you need the Hive part?
>
> Yong
>
> --
> Date: Mon, 26 Oct 2015 11:34:30 -0400
> Subject: Problem with make-distribution.sh
Hi folks,
building spark instructions (
http://spark.apache.org/docs/latest/building-spark.html) suggest that
./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn
should produce a distribution similar to the ones found on the "Downloads"
page.
I noticed that the tgz I built u
h. We have kind of partitioned our data on the basis of batch_ids folder.
>
> How did you get around it?
>
> Thanks for help. :)
>
> On Sat, Oct 10, 2015 at 7:55 AM, Yana Kadiyska
> wrote:
>
>> can you show the output of df.printSchema? Just a guess but I think I ran
>> i
"Job has not accepted resources" is a well-known error message -- you can
search the Internet. 2 common causes come to mind:
1) you already have an application connected to the master -- by default a
driver will grab all resources so unless that application disconnects,
nothing else is allowed to c
can you show the output of df.printSchema? Just a guess but I think I ran
into something similar with a column that was part of a path in parquet.
E.g. we had an account_id in the parquet file data itself which was of type
string but we also named the files in the following manner
/somepath/account
o check to see if offsets 0 through 100 are still actually
> present in the kafka logs.
>
> On Tue, Sep 22, 2015 at 9:38 AM, Yana Kadiyska
> wrote:
>
>> Hi folks, I'm trying to write a simple Spark job that dumps out a Kafka
>> queue into HDFS. Being very new to Kafka, not
Hi folks, I'm trying to write a simple Spark job that dumps out a Kafka
queue into HDFS. Being very new to Kafka, not sure if I'm messing something
up on that side...My hope is to read the messages presently in the queue
(or at least the first 100 for now)
Here is what I have:
Kafka side:
./bin/
Hopefully someone will give you a more direct answer but whenever I'm
having issues with log4j I always try -Dlog4j.debug=true.This will tell you
which log4j settings are getting picked up from where. I've spent countless
hours due to typos in the file, for example.
On Mon, Sep 7, 2015 at 11:47 AM
tions in parallel. It will run out of
> memory if (number of partitions) times (Parquet block size) is greater than
> the available memory. You can try to decrease the number of partitions. And
> could you share the value of "parquet.block.size" and your available memory?
>
Hi folks, I have a strange issue. Trying to read a 7G file and do failry
simple stuff with it:
I can read the file/do simple operations on it. However, I'd prefer to
increase the number of partitions in preparation for more memory-intensive
operations (I'm happy to wait, I just need the job to com
If memory serves me correctly in 1.3.1 at least there was a problem with
when the driver was added -- the right classloader wasn't picking it up.
You can try searching the archives, but the issue is similar to these
threads:
http://stackoverflow.com/questions/30940566/connecting-from-spark-pyspark-
The PermGen space error is controlled with MaxPermSize parameter. I run
with this in my pom, I think copied pretty literally from Spark's own
tests... I don't know what the sbt equivalent is but you should be able to
pass it...possibly via SBT_OPTS?
org.scalatest
sca
I'm having trouble with refreshTable, I suspect because I'm using it
incorrectly.
I am doing the following:
1. Create DF from parquet path with wildcards, e.g. /foo/bar/*.parquet
2. use registerTempTable to register my dataframe
3. A new file is dropped under /foo/bar/
4. Call hiveContext.refres
When I run spark-shell, I run with "--master
> mesos://cluster-1:5050" parameter which is the same with "spark-submit".
> Confused here.
>
>
>
> 2015-07-22 20:01 GMT-05:00 Yana Kadiyska :
>
>> Is it complaining about "collect" or "toMap&
Hi folks, having trouble expressing IN and COLLECT_SET on a dataframe. In
other words, I'd like to figure out how to write the following query:
"select collect_set(b),a from mytable where c in (1,2,3) group by a"
I've started with
someDF
.where( -- not sure what do for c here---
.groupBy($
Is it complaining about "collect" or "toMap"? In either case this error is
indicative of an old version usually -- any chance you have an old
installation of Spark somehow? Or scala? You can try running spark-submit
with --verbose. Also, when you say it runs with spark-shell do you run
spark shell
Have you tried to examine what clean_cols contains -- I'm suspect of this
part mkString(“, “).
Try this:
val clean_cols : Seq[String] = df.columns...
if you get a type error you need to work on clean_cols (I suspect yours is
of type String at the moment and presents itself to Spark as a single
col
Hi, could someone point me to the recommended way of using
countApproxDistinctByKey with DataFrames?
I know I can map to pair RDD but I'm wondering if there is a simpler
method? If someone knows if this operations is expressible in SQL that
information would be most appreciated as well.
Have you seen this SO thread:
http://stackoverflow.com/questions/13471519/running-daemon-with-exec-maven-plugin
This seems to be more related to the plugin than Spark, looking at the
stack trace
On Tue, Jul 14, 2015 at 8:11 AM, Hafsa Asif
wrote:
> I m still looking forward for the answer. I
Oh, this is very interesting -- can you explain about your dependencies --
I'm running Tomcat 7 and ended up using spark-assembly from WEB_INF/lib and
removing the javax/servlet package out of it...but it's a pain in the neck.
If I'm reading your first message correctly you use hadoop common and
sp
t: Row): Validator = {
>
> var check1: Boolean = if (input.getDouble(shortsale_in_pos) >
> 140.0) true else false
>
> if (check1) this else Nomatch
>
> }
>
> }
>
>
>
> Saif
>
>
>
> *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.co
It's a bit hard to tell from the snippets of code but it's likely related
to the fact that when you serialize instances the enclosing class, if any,
also gets serialized, as well as any other place where fields used in the
closure come from...e.g.check this discussion:
http://stackoverflow.com/ques
Have you seen https://issues.apache.org/jira/browse/SPARK-6910I opened
https://issues.apache.org/jira/browse/SPARK-6984 which I think is related
to this as well. There are a bunch of issues attached to it but basically
yes, Spark interactions with a large metastore are bad...very bad if your
me
| 32| 0|
| 1| 32| 1|
+---+---+---+
On Thu, Jul 9, 2015 at 11:54 AM, ayan guha wrote:
> Can you please post result of show()?
> On 10 Jul 2015 01:00, "Yana Kadiyska" wrote:
>
>> Hi folks, I just re-wrote a query from using UNION ALL to use "with
>> ro
Hi folks, I just re-wrote a query from using UNION ALL to use "with rollup"
and I'm seeing some unexpected behavior. I'll open a JIRA if needed but
wanted to check if this is user error. Here is my code:
case class KeyValue(key: Int, value: String)
val df = sc.parallelize(1 to 50).map(i=>KeyValue(
Hi folks, suffering from a pretty strange issue:
Is there a way to tell what object is being successfully
serialized/deserialized? I have a maven-installed jar that works well when
fat jarred within another, but shows the following stack when marked as
provided and copied to the runtime classpath.
r replying
Original message From: Akhil Das
Date:07/01/2015 2:27 AM (GMT-05:00)
To: Yana Kadiyska Cc:
user@spark.apache.org Subject: Re: Difference between
spark-defaults.conf and SparkConf.set
.addJar works for me when i run it as a stand-alone application (without
using sp
I wonder if this could be a side effect of Spark-3928. Does ending the path
with *.parquet work?
Original message From: Exie
Date:06/30/2015 9:20 PM (GMT-05:00)
To: user@spark.apache.org Subject: 1.4.0
So I was delighted with Spark 1.3.1 using Parquet 1.6.0 which would
"par
Hi folks, running into a pretty strange issue:
I'm setting
spark.executor.extraClassPath
spark.driver.extraClassPath
to point to some external JARs. If I set them in spark-defaults.conf
everything works perfectly.
However, if I remove spark-defaults.conf and just create a SparkConf and
call
.set(
Pass that debug string to your executor like this: --conf
spark.executor.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=
7761". When your executor is launched it will send debug information on
port 7761. When you attach the Eclipse debugger, you need to have the
IP
try?
>
> Thanks
> Best Regards
>
> On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska
> wrote:
>
>> Hi folks, I have been using Spark against an external Metastore service
>> which runs Hive with Cdh 4.6
>>
>> In Spark 1.2, I was able to successfully connect
I can't tell immediately, but you might be able to get more info with the
hint provided here:
http://stackoverflow.com/questions/27980781/spark-task-not-serializable-with-simple-accumulator
(short version, set -Dsun.io.serialization.extendedDebugInfo=true)
Also, unless you're simplifying your exam
Hi folks, I have been using Spark against an external Metastore service
which runs Hive with Cdh 4.6
In Spark 1.2, I was able to successfully connect by building with the
following:
./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0
-Phive-thriftserver -Phive-0.12.0
I see that in S
Can you show some code how you're doing the reads? Have you successfully
read other stuff from Cassandra (i.e. do you have a lot of experience with
this path and this particular table is causing issues or are you trying to
figure out the right way to do a read).
What version of Spark and Cassandra
When all else fails look at the source ;)
Looks like createJDBCTable is deprecated, but otherwise goes to the same
implementation as insertIntoJDBC...
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
You can also look at DataFrameWriter in t
Hi folks,
running into a pretty strange issue -- I have a ClassNotFound exception
from a closure?! My code looks like this:
val jRdd1 = table.map(cassRow=>{
val lst = List(cassRow.get[Option[Any]](0),cassRow.get[Option[Any]](1))
Row.fromSeq(lst)
})
println(s"This one worked .
John, I took the liberty of reopening because I have sufficient JIRA
permissions (not sure if you do). It would be good if you can add relevant
comments/investigations there.
On Thu, Jun 11, 2015 at 8:34 AM, John Omernik wrote:
> Hey all, from my other post on Spark 1.3.1 issues, I think we foun
; ... 28 more
>>
>>
>>
>> The Spark Cassandra Connector is trying to use a method, which does not
>> exists. That means your assembly jar has the wrong version of the library
>> that SCC is trying to use. Welcome to jar hell!
>>
>>
&
my code from datastax spark-cassandra-connector
> <https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-demos/simple-demos/src/main/java/com/datastax/spark/connector/demo/JavaApiDemo.java>
> .
>
> Thanx alot.
> yasemin
>
> 2015-06
160
>>
>> I check the port "nc -z localhost 9160; echo $?" it returns me "0". I
>> think it close, should I open this port ?
>>
>> 2015-06-09 16:55 GMT+03:00 Yana Kadiyska :
>>
>>> Is your cassandra installation actually listeni
# port for Thrift to listen for clients on
rpc_port: 9160
On Tue, Jun 9, 2015 at 7:36 AM, Yasemin Kaya wrote:
> I couldn't find any solution. I can write but I can't read from Cassandra.
>
> 2015-06-09 8:52 GMT+03:00 Yasemin Kaya :
>
>> Thanks alot Mohammed, Gerard
yes, whatever you put for listen_address in cassandra.yaml. Also, you
should try to connect to your cassandra cluster via bin/cqlsh to make sure
you have connectivity before you try to make a a connection via spark.
On Mon, Jun 8, 2015 at 4:43 AM, Yasemin Kaya wrote:
> Hi,
> I run my project on
n compile my app to run this without -Dconfig.file=alt_
> reference1.conf?
>
> 2015-06-02 15:43 GMT+02:00 Yana Kadiyska :
>
>> This looks like your app is not finding your Typesafe config. The config
>> should usually be placed in particular folder under your app to be seen
&
This looks like your app is not finding your Typesafe config. The config
should usually be placed in particular folder under your app to be seen
correctly. If it's in a non-standard location you can
pass -Dconfig.file=alt_reference1.conf to java to tell it where to look.
If this is a config that b
Like this...sqlContext should be a HiveContext instance
case class KeyValue(key: Int, value: String)
val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
df.registerTempTable("table")
sqlContext.sql("select percentile(key,0.5) from table").show()
On Tue, Jun 2, 2015 at 8:07 AM,
1. Yes if two tasks depend on each other they cant parallelize
2. Imagine something like a web application driver. You only get to have 1
spark context but now you want to run many concurrent jobs. They have nothing 2
do with each other; no reason to keep them sequential.
Hope this helps
-
are you able to connect to your cassandra installation via
cassandra_home$ ./bin/cqlsh
This exception generally means that your cassandra instance is not
reachable/accessible
On Fri, May 29, 2015 at 6:11 AM, Antonio Giambanco
wrote:
> Hi all,
> I have in a single server installed spark 1.3.1 a
> 2015/05/27
> 12:44:03 steve FINISHED 6 s
> app-20150527123822- TestRunner: power-iteration-clustering 8 512.0 MB
> 2015/05/27
> 12:38:22 steve FINISHED 6 s
>
>
>
> 2015-05-27 11:42 GMT-07:00 Stephen Boesch :
>
> Thanks Yana,
>>
>>My current ex
I have not run into this particular issue but I'm not using latest bits in
production. However, testing your theory should be easy -- MySQL is just a
database, so you should be able to use a regular mysql client and see how
many connections are active. You can then compare to the maximum allowed
co
Hi folks, for those of you working with Cassandra, wondering if anyone has
been successful processing a mix of Cassandra and hdfs data. I have a
dataset which is stored partially in HDFS and partially in Cassandra
(schema is the same in both places)
I am trying to do the following:
val dfHDFS = s
Todd, I don't have any answers for you...other than the file is actually
named spark-defaults.conf (not sure if you made a typo in the email or
misnamed the file...). Do any other options from that file get read?
I also wanted to ask if you built the spark-cassandra-connector-assembly-1.3
.0-SNAPS
There is an open Jira on Spark not pushing predicates to metastore. I have a
large dataset with many partitions but doing anything with it 8s very
slow...But I am surprised Spark 1.2 worked for you: it has this problem...
Original message From: Andrew Otto
Date:05/22/2015 3:5
I have not seen this error but have seen another user have weird parser
issues before:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3ccag6lhyed_no6qrutwsxeenrbqjuuzvqtbpxwx4z-gndqoj3...@mail.gmail.com%3E
I would attach a debugger and see what is going on -- if I'm looking a
I'm afraid you misunderstand the purpose of hive-site.xml. It configures
access to the Hive metastore. You can read more here:
http://www.hadoopmaterial.com/2013/11/metastore.html.
So the MySQL DB in hive-site.xml would be used to store hive-specific data
such as schema info, partition info, etc.
But if I'm reading his email correctly he's saying that:
1. The master and slave are on the same box (so network hiccups are
unlikely culprit)
2. The failures are intermittent -- i.e program works for a while then
worker gets disassociated...
Is it possible that the master restarted? We used to h
Just a guess but are you using HiveContext in one case vs SqlContext inanother?
You dont show a stacktrace but this looks like parser error...Which would make
me guess different context or different spark versio on the cluster you are
submitting to...
Sent on the new Sprint Network from my Sa
er all (still if I go to spark-shell and try to print out the SQL
>> settings that I put in hive-site.xml, it does not print them).
>>
>>
>> On Fri, May 15, 2015 at 7:22 PM, Yana Kadiyska
>> wrote:
>>
>>> My point was more to how to verify that propert
59:33 INFO HiveMetaStore: Added admin role in metastore
> 15/05/15 17:59:34 INFO HiveMetaStore: Added public role in metastore
> 15/05/15 17:59:34 INFO HiveMetaStore: No user is added in admin role,
> since config is empty
> 15/05/15 17:59:35 INFO SessionState: No Tez session required at this
&
Thanks Sean, with the added permissions I do now have this extra option.
On Fri, May 15, 2015 at 11:20 AM, Sean Owen wrote:
> (I made you a Contributor in JIRA -- your yahoo-related account of the
> two -- so maybe that will let you do so.)
>
> On Fri, May 15, 2015 at 4:19 PM, Y
Hi, two questions
1. Can regular JIRA users reopen bugs -- I can open a new issue but it does
not appear that I can reopen issues. What is the proper protocol to follow
if we discover regressions?
2. I believe SPARK-4412 regressed in Spark 1.3.1, according to this SO
thread possibly even in 1.3.0
This should work. Which version of Spark are you using? Here is what I do
-- make sure hive-site.xml is in the conf directory of the machine you're
using the driver from. Now let's run spark-shell from that machine:
scala> val hc= new org.apache.spark.sql.hive.HiveContext(sc)
hc: org.apache.spark.
Hi folks, I'm trying to use Automatic partition discovery as descibed here:
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
/data/year=2014/file.parquet/data/year=2015/file.parquet
…
SELECT * FROM table WHERE year = 2015
I have an official 1.3.1 CDH4 bui
2015 at 8:43 AM, yana wrote:
> I believe this is a regression. Does not work for me either. There is a
> Jira on parquet wildcards which is resolved, I'll see about getting it
> reopened
>
>
> Sent on the new Sprint Network from my Samsung Galaxy S®4.
>
>
> ---
I believe this is a regression. Does not work for me either. There is a Jira on
parquet wildcards which is resolved, I'll see about getting it reopened
Sent on the new Sprint Network from my Samsung Galaxy S®4.
Original message From: Vaxuki
Date:05/07/2015 7:38 AM (GMT-05:0
Hi folks, we have been using the a JDBC connection to Spark's Thrift Server
so far and using JDBC prepared statements to escape potentially malicious
user input.
I am trying to port our code directly to HiveContext now (i.e. eliminate
the use of Thrift Server) and I am not quite sure how to genera
You can try pulling the jar with wget and using it with -jars with Spark shell.
I used 1.0.3 with Spark 1.3.0 but with a different version of scala. From the
stack trace it looks like Spark shell is just not seeing the csv jar...
Sent on the new Sprint Network from my Samsung Galaxy S®4.
-
Yes. Fair schedulwr only helps concurrency within an application. With
multiple shells you'd either need something like Yarn/Mesos or careful math on
resources as you said
Sent on the new Sprint Network from my Samsung Galaxy S®4.
Original message From: Arun Patel
Date:04/2
Hi Sparkers,
hoping for insight here:
running a simple describe mytable here where mytable is a partitioned Hive
table.
Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02,
SQL query: 72.831, Reading results: 0.189
Whereas Hive over the same
Hi Spark users,
Trying to upgrade to Spark1.2 and running into the following
seeing some very slow queries and wondering if someone can point me in the
right direction for debugging. My Spark UI shows a job with duration 15s
(see attached screenshot). Which would be great but client side measurem
Hi folks, I am noticing a pesky and persistent warning in my logs (this is
from Spark 1.2.1):
15/04/08 15:23:05 WARN ShellBasedUnixGroupsMapping: got exception
trying to get groups for user anonymous
org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user
at org.apach
If you're going to do it this way, I would ouput dayOfdate.substring(0,7),
i.e. the month part, and instead of weatherCond, you can use
(month,(minDeg,maxDeg,meanDeg)) --i.e. PairRDD. So weathersRDD:
RDD[(String,(Double,Double,Double))]. Then use a reduceByKey as shown in
multiple Spark examples..Y
Hi folks, currently have a DF that has a factor variable -- say gender.
I am hoping to use the RandomForest algorithm on this data an it appears
that this needs to be converted to RDD[LabeledPoint] first -- i.e. all
features need to be double-encoded.
I see https://issues.apache.org/jira/browse/S
Hi folks, having some seemingly noob issues with the dataframe API.
I have a DF which came from the csv package.
1. What would be an easy way to cast a column to a given type -- my DF
columns are all typed as strings coming from a csv. I see a schema getter
but not setter on DF
2. I am trying to
You might also want to see if TaskScheduler helps with that. I have not
used it with Windows 2008 R2 but it generally does allow you to schedule a
bat file to run on startup
On Wed, Mar 11, 2015 at 10:16 AM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:
>
> Thanks for the suggestio
I think the problem is that you are using an alias in the having clause. I am
not able to try just now but see if HAVING count (*)> 2 works ( ie dont use cnt)
Sent on the new Sprint Network from my Samsung Galaxy S®4.
Original message From: shahab
Date:03/04/2015 7:22 AM (G
he current
directory, it is under /user/hive. In my case this directory was owned by
'root' and noone else had write permissions. Changing the permissions works
if you need to get unblocked quickly...But it does seem like a bug to me...
On Fri, Feb 27, 2015 at 11:21 AM, sandeep vura
wrot
I think you're mixing two things: the docs say "When* not *configured by
the hive-site.xml, the context automatically creates metastore_db and
warehouse in the current directory.". AFAIK if you want a local metastore,
you don't put hive-site.xml anywhere. You only need the file if you're
going to p
Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the
shuffle setting: spark.sql.shuffle.partitions
On Thu, Feb 26, 2015 at 5:51 PM, java8964 wrote:
> Imran, thanks for your explaining about the parallelism. That is very
> helpful.
>
> In my test case, I am only use one box cl
Imran, I have also observed the phenomenon of reducing the cores helping
with OOM. I wanted to ask this (hopefully without straying off topic): we
can specify the number of cores and the executor memory. But we don't get
to specify _how_ the cores are spread among executors.
Is it possible that wi
Can someone confirm if they can run UDFs in group by in spark1.2?
I have two builds running -- one from a custom build from early December
(commit 4259ca8dd12) which works fine, and Spark1.2-RC2.
On the latter I get:
jdbc:hive2://XXX.208:10001> select
from_unixtime(epoch,'-MM-dd-HH'),count(
rt my application, I can see that it is using that
scheduler in the UI -- go to master UI and click on your application. Then
you should see this (note the scheduling mode is shown as Fair):
On Wed, Feb 25, 2015 at 4:06 AM, Harika Matha
wrote:
> Hi Yana,
>
> I tried running the program
Shark used to have shark.map.tasks variable. Is there an equivalent for
Spark SQL?
We are trying a scenario with heavily partitioned Hive tables. We end up
with a UnionRDD with a lot of partitions underneath and hence too many
tasks:
https://github.com/apache/spark/blob/master/sql/hive/src/main/sc
It's hard to tell. I have not run this on EC2 but this worked for me:
The only thing that I can think of is that the scheduling mode is set to
- *Scheduling Mode:* FAIR
val pool: ExecutorService = Executors.newFixedThreadPool(poolSize)
while_loop to get curr_job
pool.execute(new ReportJ
memory.
I thought deleting the checkpoints in a checkpointed application is the
last thing that you want to do (as you lose all state). Seems a bit harsh
to have to do this just to increase the amount of memory?
On Mon, Feb 23, 2015 at 11:12 PM, Tathagata Das wrote:
> Hey Yana,
>
> I
Hi all,
I had a streaming application and midway through things decided to up the
executor memory. I spent a long time launching like this:
~/spark-1.2.0-bin-cdh4/bin/spark-submit --class StreamingTest
--executor-memory 2G --master...
and observing the executor memory is still at old 512 setting
1 - 100 of 213 matches
Mail list logo