I am storing the output of mapPartitions in a ListBuffer and exposing its
iterator as the output. The output is a list of Long tuples(Tuple2). When I
check the size of the object using Spark's SizeEstimator.estimate method it
comes out to 80 bytes per record/tuple object(calculating this by "size o
I need to add all the jars in hive/lib to my spark job executor classpath. I
tried this
spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib
and
spark.executor.extraClassPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib/*
but it does not ad
I am running hive queries using HiveContext from my Spark code. No matter
which query I run and how much data it is, it always generates 31
partitions. Anybody knows the reason? Is there a predefined/configurable
setting for it? I essentially need more partitions.
I using this code snippet to exec
My mapPartition code as given below outputs one record for each input record.
So, the output object has equal number of records as input. I am loading the
output data into a listbuffer object. This object is turning out to be too
huge for memory leading to Out Of Memory exception.
To be more clea
CDH version: 5.3
Spark Version: 1.2
I was trying to execute a Hive query from Spark code(using HiveContext
class). It was working fine untill we installed Apache Sentry. Now its
giving me read permission exception.
/org.apache.hadoop.security.AccessControlException: Permission denied:
user=kakn
Hey, I have exactly this question. Did you get an answer to it?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200p23431.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
I am trying to run a hive query from Spark code using HiveContext object. It
was running fine earlier but since the Apache Sentry has been set installed
the process is failing with this exception :
/org.apache.hadoop.security.AccessControlException: Permission denied:
user=kakn, access=READ_EXECUT
I am running hive queries from HiveContext, for which we need a
hive-site.xml.
Is it possible to replace it with hive-config.xml? I tried but does not
work. Just want a conformation.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-use-hive-confi
I have a list of cloudera jars which I need to provide in --jars clause,
mainly for the HiveContext functionality I am using. However, many of these
jars have version number as part of their names. This leads to an issue that
the names might change when I do a Cloudera upgrade.
Just a note here,
Any Ideas on this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Weird-exception-in-Spark-job-tp22195p22204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
I am wondering if HiveContext connects to HiveServer2 or does it work though
Hive CLI. The reason I am asking is because Cloudera has deprecated Hive
CLI.
If the connection is through HiverServer2, is there a way to specify user
credentials?
--
View this message in context:
http://apache-spar
I am trying to run a Hive query from Spark code through HiveContext. Anybody
knows what these exceptions mean? I have no clue
LogType: stderr
LogLength: 3345
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3
Is yarn-standalone mode deprecated in Spark now. The reason I am asking is
because while I can find it in 0.9.0
documentation(https://spark.apache.org/docs/0.9.0/running-on-yarn.html). I
am not able to find it in 1.2.0.
I am using this mode to run the Spark jobs from Oozie as a java action.
Remov
I am trying to run a Hive query from Spark using HiveContext. Here is the
code
/ val conf = new SparkConf().setAppName("HiveSparkIntegrationTest")
conf.set("spark.executor.extraClassPath",
"/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib");
conf.set("spark.driver.ext
I am also starting to work on this one. Did you get any solution to this
issue?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-jobs-via-oozie-tp5187p21896.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
I want to run Hive query inside Spark and use the RDDs generated from that
inside Spark. I read in the documentation
"/Hive support is enabled by adding the -Phive and -Phive-thriftserver flags
to Spark’s build. This command builds a new assembly jar that includes Hive.
Note that this Hive assemb
I am trying to implement counters in Spark and I guess Accumulators are the
way to do it.
My motive is to update a counter in map function and access/reset it in the
driver code. However the /println/ statement at the end still yields value
0(It should 9). Am I doing something wrong?
def main(arg
I am trying to debug my mapPartitionsFunction. Here is the code. There are
two ways I am trying to log using log.info() or println(). I am running in
yarn-cluster mode. While I can see the logs from driver code, I am not able
to see logs from map, mapPartition functions in the Application Tracking
I want to get a Hive table schema details into Spark. Specifically, I want to
get column name and type information. Is it possible to do it e.g using
JavaSchemaRDD or something else?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-Hive-table-sch
Just to add, I am suing Spark 1.1.0
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487p21488.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
I am trying to implement secondary sort in spark as we do in map-reduce.
Here is my data(tab separated, without c1, c2, c2).
c1c2 c3
1 2 4
1 3 6
2 4 7
2 6 8
3 5 5
3 1 8
3 2 0
To do secondary sort, I crea
Mine was not really a moving average problem. It was more like partitioning
on some keys and sorting(on different keys) and then running a sliding
window through the partition. I reverted back to map-reduce for that(I
needed secondary sort, which is not very mature in Spark right now).
But, as far
I am trying to run connected components algorithm in Spark. The graph has
roughly 28M edges and 3.2M vertices. Here is the code I am using
/val inputFile =
"/user/hive/warehouse/spark_poc.db/window_compare_output_text/00_0"
val conf = new SparkConf().setAppName("ConnectedComponentsTest")
I am trying to run connected components on a graph generated by reading an
edge file. Its running for a long time(3-4 hrs) and then eventually failing.
Cant find any error in log file. The file I am testing it on has 27M
rows(edges). Is there something obviously wrong with the code?
I tested the s
I need to sort my RDD partitions but the whole partition(s) might not fit
into memory, so I cannot run the Collections Sort() method. Does Spark
support partitions sorting by virtue of its framework? I am working on 1.1.0
version.
I looked up similar unanswered question:
/http://apache-spark-user
So, this means that I can create table and insert data in it with Dynbamic
partitioning and those partitions would be inherited by RDDs. Is it in Spark
1.1.0?
If not, is there a way to partition the data in a file based on some
attributes of the rows in the data data(without hardcoding the number
Any suggestions guys??
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17539.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
Would the rdd resulting from the below query be partitioned on GEO_REGION,
GEO_COUNTRY? I ran some tests(using mapPartitions on the resulting RDD) and
seems that there are always 50 partitions generated while there should be
around 1000.
/"SELECT * FROM spark_poc.DISTRIBUTE BY GEO_REGION, GEO_COUN
I am working on running the following hive query from spark.
/"SELECT * FROM spark_poc. DISTRIBUTE BY GEO_REGION, GEO_COUNTRY
SORT BY IP_ADDRESS, COOKIE_ID"/
Ran into /java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;/
(complete stack tr
I am trying to run the below Hive query on Yarn. I am using Cloudera 5.1.
What can I do to make this work?
/"SELECT * FROM table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY
IP_ADDRESS, COOKIE_ID";/
Below is the stack trace:
Exception in thread "Thread-4" java.lang.reflect.InvocationTarget
I am running a simple rdd filter command. What does it mean?
Here is the full stack trace(and code below it):
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 133
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at
com.esotericsoftw
. . . . . . . . . . . . .*
>
>
> *IF WE CAN’T DOUBLE YOUR SALES,*
>
>
>
> *ONE OF US IS IN THE WRONG BUSINESS.*
>
> *E*: [hidden email] <http://user/SendEmail.jtp?type=node&node=15407&i=0>
>
>
> *M*: *510.303.7751 <510.303.7751>*
>
> On Tue,
Any ideas guys?
Trying to find some information online. Not much luck so far.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15404.html
Sent from the Apache Spark User Li
Need to know the feasibility of the below task. I am thinking of this one to
be a mapreduce-spark effort.
I need to run distributed sliding Window Comparison for digital data
matching on top of Hadoop. The data(Hive Table) will be partitioned,
distributed across data node. Then the window comparis
34 matches
Mail list logo