hi luo,
thanks for your reply in fact I can use hive by spark on my spark master
machine, but when I copy my spark files to another machine and when I want to
access the hive by spark get the error "Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient" , I have copy
hiv
Hi Shahab,
I’ve seen exceptions very similar to this (it also manifests as negative
array size exception), and I believe it’s a really bug in Kryo.
See this thread:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccag02ijuw3oqbi2t8acb5nlrvxso2xmas1qrqd_4fq1tgvvj...@mail.gmail
Hi ,
I am trying to use sparkSQL to join tables in different data sources - hive
and teradata. I can access the tables individually but when I run join query
I get an query exception.
The same query runs if all the tables exist in teradata.
Any help would be appreciated.
I am running the foll
How did you configure your metastore?
Thanks,
Daoyuan
From: 鹰 [mailto:980548...@qq.com]
Sent: Tuesday, May 05, 2015 3:11 PM
To: luohui20001
Cc: user
Subject: 回复:spark Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
hi luo,
thanks for your reply in fact I can use hive
my metastore is like this
javax.jdo.option.ConnectionURL
jdbc:mysql://192.168.1.40:3306/hive
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore
javax.jdo.option.ConnectionUs
Have you config the SPARK_CLASSPATH with the jar of mysql in spark-env.sh?For
example (export SPARK_CLASSPATH+=:/path/to/mysql-connector-java-5.1.18-bin.jar)
> ?? 2015??5??53:32 <980548...@qq.com> ??
>
> my metastore is like this
>
> javax.jdo.option.ConnectionURL
>
The error shows d_year column does not exist. You may need to modify the
query.
On 5 May 2015 17:20, "Ishwardeep Singh"
wrote:
> Hi ,
>
> I am trying to use sparkSQL to join tables in different data sources - hive
> and teradata. I can access the tables individually but when I run join
> query
>
Thanks Tristan for sharing this. Actually this happens when I am reading a
csv file of 3.5 GB.
best,
/Shahab
On Tue, May 5, 2015 at 9:15 AM, Tristan Blakers
wrote:
> Hi Shahab,
>
> I’ve seen exceptions very similar to this (it also manifests as negative
> array size exception), and I believe
Turned out that is was sufficient do to repartitionAndSortWithinPartitions
... so far so good ;)
On Tue, May 5, 2015 at 9:45 AM Marius Danciu
wrote:
> Hi Imran,
>
> Yes that's what MyPartitioner does. I do see (using traces from
> MyPartitioner) that the key is partitioned on partition 0 but the
Hi ,
I am using Spark 1.3.0.
I was able to join a JSON file on HDFS registered as a TempTable with a table
in MySQL. On the same lines I tried to join a table in Hive with another table
in Teradata but I get a query parse exception.
Regards,
Ishwardeep
From: ankitjindal [via Apache Spark Use
thanks jeanlyn itworks
-- --
??: "jeanlyn";;
: 2015??5??5??(??) 3:40
??: "??"<980548...@qq.com>;
: "Wang, Daoyuan"; "user";
: Re: spark Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Have
I know these methods , but i need to create events using the timestamps in
the data tuples ,means every time a new tuple is generated using the
timestamp in a CSV file .this will be useful to simulate the data rate
with time just like real sensor data .
On Fri, May 1, 2015 at 2:52 PM, Juan Rodrí
Hi all,
i have declared spark context at start of my program and then i want to
change it's configurations at some later stage in my code as written below
val conf = new SparkConf().setAppName("Cassandra Demo")
var sc:SparkContext=new SparkContext(conf)
sc.getConf.set("spark.cassandra.connection.
Hi,
how important is JAVA for Spark certification? Will learning only Python
and Scala not work?
Regards,
Gourav
Hi,
In Hive , I am using unix_timestamp() as 'update_on' to insert current date in
'update_on' column of the table. Now I am converting it into spark sql. Please
suggest example code to insert current date and time into column of the table
using spark sql.
CheersKiran.
Hey there,
1.) I'm loading 2 avro files with that have slightly different schema
df1 = sqlc.load(file1, "com.databricks.spark.avro")
df2 = sqlc.load(file2, "com.databricks.spark.avro")
2.) I want to unionAll them
nfd = dfs1.unionAll(dfs2)
3.) Getting the following error
--
I too have similar question.
My understanding is since Spark written in scala, having done in Scala will
be ok for certification.
If someone who has done certification can confirm.
Thanks,
Kartik
On May 5, 2015 5:57 AM, "Gourav Sengupta" wrote:
> Hi,
>
> how important is JAVA for Spark certif
There are questions in all three languages.
2015-05-05 3:49 GMT-07:00 Kartik Mehta :
> I too have similar question.
>
> My understanding is since Spark written in scala, having done in Scala
> will be ok for certification.
>
> If someone who has done certification can confirm.
>
> Thanks,
>
> Kar
Hey,
What is the recommended way to create literal columns in java?
Scala has the `lit` function from `org.apache.spark.sql.functions`.
Should it be called from java as well?
Cheers jan
-
To unsubscribe, e-mail: user-unsubscr...
Hi, Spark experts:
I did rdd.coalesce(numPartitions).saveAsSequenceFile("dir") in my code, which
generates the rdd's in streamed batches. It generates numPartitions of files as
expected with names dir/part-x. However, the first couple of files (e.g.,
part-0, part-1) have many times o
And how important is to have production environment?
On 5 May 2015 20:51, "Stephen Boesch" wrote:
> There are questions in all three languages.
>
> 2015-05-05 3:49 GMT-07:00 Kartik Mehta :
>
>> I too have similar question.
>>
>> My understanding is since Spark written in scala, having done in Sca
I'm tryting to execute the "Hello World" example with Spark + Kafka (
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala)
with createDirectStream and I get this error.
java.lang.NoSuchMethodError:
kafka.mes
Hi Ankit,
printSchema() works fine for all the tables.
hiveStoreSalesDF.printSchema()
root
|-- store_sales.ss_sold_date_sk: integer (nullable = true)
|-- store_sales.ss_sold_time_sk: integer (nullable = true)
|-- store_sales.ss_item_sk: integer (nullable = true)
|-- store_sales.ss_customer_sk: in
56mb / 26mb is very small size, do you observe data skew? More precisely, many
records with the same chrname / name? And can you also double check the jvm
settings for the executor process?
From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Tuesday, May 5, 2015 7:50 PM
To: Cheng, H
I suggest you to try with date_dim.d_year in the query
On Tue, May 5, 2015 at 10:47 PM, Ishwardeep Singh <
ishwardeep.si...@impetus.co.in> wrote:
> Hi Ankit,
>
>
>
> printSchema() works fine for all the tables.
>
>
>
> hiveStoreSalesDF.printSchema()
>
> root
>
> |-- store_sales.ss_sold_date_sk:
Hi
We are using pyspark 1.3 and input is text files located on hdfs.
file structure
file1.txt
file2.txt
file1.txt
file2.txt
...
Question:
1) What is the way to provide as an input for PySpark job multiple
files
Sorry, I had a duplicated kafka dependency with another older version in
another pom.xml
2015-05-05 14:46 GMT+02:00 Guillermo Ortiz :
> I'm tryting to execute the "Hello World" example with Spark + Kafka (
> https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache
I want to read from many topics in Kafka and know from where each message
is coming (topic1, topic2 and so on).
val kafkaParams = Map[String, String]("metadata.broker.list" ->
"myKafka:9092")
val topics = Set("EntryLog", "presOpManager")
val directKafkaStream = KafkaUtils.createDirectStream[Str
Production - not whole lot of companies have implemented Spark in
production and so though it is good to have, not must.
If you are on LinkedIn, a group of folks including myself are preparing for
Spark certification, learning in group makes learning easy and fun.
Kartik
On May 5, 2015 7:31 AM, "
I might join in to this conversation with an ask. Would someone point me to
a decent exercise that would approximate the level of this exam (from
above)? Thanks!
On Tue, May 5, 2015 at 3:37 PM Kartik Mehta
wrote:
> Production - not whole lot of companies have implemented Spark in
> production an
Make sure to read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
The directStream / KafkaRDD has a 1 : 1 relationship between kafka
topic/partition and spark partition. So a given spark partition only has
messages from 1 kafka topic. You can tell what topic that is using
Hello guys
Q1: How does Spark determine the number of partitions when reading a Parquet
file?
val df = sqlContext.parquetFile(path)
Is it some way related to the number of Parquet row groups in my input?
Q2: How can I reduce this number of partitions? Doing this:
df.rdd.coalesce(200).count
f
Very interested @Kartik/Zoltan. Please let me know how to connect on LI
On Tue, May 5, 2015 at 11:47 PM, Zoltán Zvara
wrote:
> I might join in to this conversation with an ask. Would someone point me
> to a decent exercise that would approximate the level of this exam (from
> above)? Thanks!
>
>
can you give your entire spark submit command? Are you missing
"--executor-cores "? Also, if you intend to use all 6 nodes, you
also need "--num-executors 6"
On Mon, May 4, 2015 at 2:07 AM, Xi Shen wrote:
> Hi,
>
> I have two small RDD, each has about 600 records. In my code, I did
>
> val rdd
Hi,
I think all the required materials for reference are mentioned here:
http://www.oreilly.com/data/sparkcert.html?cmp=ex-strata-na-lp-na_apache_spark_certification
My question was regarding the proficiency level required for Java. There
are detailed examples and code mentioned for JAVA, Python
You might be interested in https://issues.apache.org/jira/browse/SPARK-6593
and the discussion around the PRs.
This is probably more complicated than what you are looking for, but you
could copy the code for HadoopReliableRDD in the PR into your own code and
use it, without having to wait for the
Hi Eric.
Q1:
When I read parquet files, I've tested that Spark generates so many
partitions as parquet files exist in the path.
Q2:
To reduce the number of partitions you can use rdd.repartition(x), x=>
number of partitions. Depend on your case, repartition could be a heavy task
Regards.
Miguel
Gerard is totally correct -- to expand a little more, I think what you want
to do is a solrInputDocumentJavaRDD.foreachPartition, instead of
solrInputDocumentJavaRDD.foreach:
solrInputDocumentJavaRDD.foreachPartition(
new VoidFunction>() {
@Override
public void call(Iterator docItr) {
Are you setting a really large max buffer size for kryo?
Was this fixed by https://issues.apache.org/jira/browse/SPARK-6405 ?
If not, we should open up another issue to get a better warning in these
cases.
On Tue, May 5, 2015 at 2:47 AM, shahab wrote:
> Thanks Tristan for sharing this. Actuall
Hi,
I'm using persist on different storage levels, but I found no difference on
performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might
be something wrong with my code... So where can I find the persisted RDDs
on disk so that I can make sure they were persisted indeed?
Thank y
Hi,
I'm using persist on different storage levels, but I found no difference on
performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might
be something wrong with my code... So where can I find the persisted RDDs on
disk so that I can make sure they were persisted indeed?
Thank
Hi folks, we have been using the a JDBC connection to Spark's Thrift Server
so far and using JDBC prepared statements to escape potentially malicious
user input.
I am trying to port our code directly to HiveContext now (i.e. eliminate
the use of Thrift Server) and I am not quite sure how to genera
Hi,
When we start spark job it start new HTTP server for each new job.
Is it possible to disable HTTP server for each job ?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-disable-Spark-HTTP-server-tp22772.html
Sent from the Apache Sp
Hi.
I have a spark application where I store the results into table (with
HiveContext). Some of these columns allow nulls. In Scala, this columns are
represented through Option[Int] or Option[Double].. Depend on the data type.
For example:
*val hc = new HiveContext(sc)*
*var col1: Option[Ingeger
SPARK-3490 introduced "spark.ui.enabled"
FYI
On Tue, May 5, 2015 at 8:41 AM, roy wrote:
> Hi,
>
> When we start spark job it start new HTTP server for each new job.
> Is it possible to disable HTTP server for each job ?
>
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-s
Thanks for the info !
Shing
On Tuesday, 5 May 2015, 15:11, Imran Rashid wrote:
You might be interested in https://issues.apache.org/jira/browse/SPARK-6593
and the discussion around the PRs.
This is probably more complicated than what you are looking for, but you could
copy the cod
Hi all,
I'm not able to access to the Spark Streaming running applications that I'm
submitting to the EC2 standalone cluster (spark 1.3.1) via port 4040. The
problem is that I don't even see running applications in the master's web UI
(I do see running drivers). This is the command I use to submit
Let say I am storing my data in HDFS with folder structure and file
partitioning as per below:
/analytics/2015/05/02/partition-2015-05-02-13-50-
Note that new file is created every 5 minutes.
As per my understanding, storing 5minutes file means we could not create
RDD more granular than 5minut
"As per my understanding, storing 5minutes file means we could not create
RDD more granular than 5minutes."
This depends on the file format. Many file formats are splittable (like
parquet), meaning that you can seek into various points of the file.
2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior
Option only works when you are going from case classes. Just put null into
the Row, when you want the value to be null.
On Tue, May 5, 2015 at 9:00 AM, Masf wrote:
> Hi.
>
> I have a spark application where I store the results into table (with
> HiveContext). Some of these columns allow nulls.
This should work from java too:
http://spark.apache.org/docs/1.3.1/api/java/index.html#org.apache.spark.sql.functions$
On Tue, May 5, 2015 at 4:15 AM, Jan-Paul Bultmann
wrote:
> Hey,
> What is the recommended way to create literal columns in java?
> Scala has the `lit` function from `org.apache
I have searched all replies to this question & not found an answer.I am
running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on
the same machine and trying to write output of wordcount program into HDFS
(works fine writing to a local file, /tmp/wordcount).Only line I added to
t
Hi,
I have a DataFrame that represents my data looks like this:
+-++
| col_name| data_type |
+-++
| obj_id | string |
| type| string |
| name
Hi All,
For a job I am running on Spark with a dataset of say 350,000 lines (not
big), I am finding that even though my cluster has a large number of cores
available (like 100 cores), the Spark system seems to stop after using just
4 cores and after that the runtime is pretty much a straight line n
Hi,
do you have information on how many partitions/tasks the stage/job is
running? By default there is 1 core per task, and your number of concurrent
tasks may be limiting core utilization.
There are a few settings you could play with, assuming your issue is
related to the above:
spark.default.pa
Hi all,
I'm looking to implement a Multilabel classification algorithm but I am
surprised to find that there are not any in the spark-mllib core library. Am
I missing something? Would someone point me in the right direction?
Thanks!
Peter
--
View this message in context:
http://apache-spar
I downloaded the 1.3.1 source distribution and built on Windows (laptop 8.0 and
desktop 8.1)
Here’s what I’m running:
Desktop:
Spark Master (%SPARK_HOME%\bin\spark-class2.cmd
org.apache.spark.deploy.master.Master -h desktop --port 7077)
Spark Worker (%SPARK_HOME%\bin\spark-class2.cmd
org.apache
LogisticRegression in MLlib package supports multilable classification.
Sincerely,
DB Tsai
---
Blog: https://www.dbtsai.com
On Tue, May 5, 2015 at 1:13 PM, peterg wrote:
> Hi all,
>
> I'm looking to implement a Multilabel classification algor
Hello Friends:
Here's sample output from a SparkSQL query that works, just so you can
see the
underlying data structure; followed by one that fails.
>>> # Just you you can see the DataFrame structure ...
>>>
>>> resultsRDD = sqlCtx.sql("SELECT * FROM rides WHERE
trip_time_in_secs = 3780")
>>
Hi all,
I have a large RDD that I map a function to it. Based on the nature of each
record in the input RDD, I will generate two types of data. I would like to
save each type into its own RDD. But I can't seem to find an efficient way
to do it. Any suggestions?
Many thanks.
Bill
--
Many thank
Have you looked at RDD#randomSplit() (as example) ?
Cheers
On Tue, May 5, 2015 at 2:42 PM, Bill Q wrote:
> Hi all,
> I have a large RDD that I map a function to it. Based on the nature of
> each record in the input RDD, I will generate two types of data. I would
> like to save each type into it
Hi
I have a 1.0.0 cluster with multiple worker nodes that deploy a number of
external tasks, through getRuntime().exec. Currently I have no control on how
many nodes get deployed for a given task. At times scheduler evenly distributes
the executors among all nodes and at other times it only us
Hi All,
When I try and run Spark SQL in standalone mode it appears to be missing
the parquet jar, I have to pass it as -jars and that works..
sbin/start-thriftserver.sh --jars lib/parquet-hive-bundle-1.6.0.jar
--driver-memory 28g --master local[10]
Any ideas on why? I downloaded the one pre buil
Hi all,
We will drop support for Java 6 starting Spark 1.5, tentative scheduled to
be released in Sep 2015. Spark 1.4, scheduled to be released in June 2015,
will be the last minor release that supports Java 6. That is to say:
Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8.
Spark 1.5+ (~
What happens when you try to put files to your hdfs from local filesystem?
Looks like its a hdfs issue rather than spark thing.
On 6 May 2015 05:04, "Sudarshan" wrote:
>
> I have searched all replies to this question & not found an answer.
>
> I am running standalone Spark 1.3.1 and Hortonwork's
If you mean "multilabel" (predicting multiple label values), then MLlib
does not yet support that. You would need to predict each label separately.
If you mean "multiclass" (1 label taking >2 categorical values), then MLlib
supports it via LogisticRegression (as DB said), as well as DecisionTree
Also, if not already done, you may want to try repartition your data to 50
partition s
On 6 May 2015 05:56, "Manu Kaul" wrote:
> Hi All,
> For a job I am running on Spark with a dataset of say 350,000 lines (not
> big), I am finding that even though my cluster has a large number of cores
> avail
Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro
file was created using Avro 1.7.7. Similar to the example mentioned in
http://www.infoobjects.com/spark-with-avro/
I am getting a nullPointerException on Schema read. It could be a avro
version mismatch. Has anybody had a simil
You are most probably right. I assumed others may have run into this.
When I try to put the files in there, it creates a directory structure with
the part-0 and part1 files but these files are of size 0 - no
content. The client error and the server logs have the error message shown
- which
Another thing - could it be a permission problem ?
It creates all the directory structure (in red)/tmp/wordcount/
_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
so I am guessing not.
On Tue, May 5, 2015 at 7:27 PM, Sudarshan Murty wrote:
> You are most probably right
Thanks, Im not aware of splittable file formats.
If that is the case, is number of files affect spark performance? Maybe
because overhead when opening file? And that problem is solved by having a
big sized files in splittable file format?
Any suggestion from your experience how to organize data
You should check out parquet.
If you can avoid 5minute log files, you can have an hourly (or daily!) MR
job that compacts these. Another nice thing about parquet is it has filter
push down so if you want a smaller range of time you can avoid
deserializing most of the other data
El martes, 5 de ma
Thanks.
Since join will be done in regular basis in short period of time ( let say
20s) do you have any suggestions how to make it faster?
I am thinking of partitioning data set and cache it.
Rendy
On Apr 30, 2015 6:31 AM, "Tathagata Das" wrote:
> Have you taken a look at the join section in t
Try to add one more data node or make minreplication to 0. Hdfs is trying
to replicate at least one more copy and not able to find another DN to do
thay
On 6 May 2015 09:37, "Sudarshan Murty" wrote:
> Another thing - could it be a permission problem ?
> It creates all the directory structure (in
You need to add a select clause to at least one dataframe to give them the
same schema before you can union them (much like in SQL).
On Tue, May 5, 2015 at 3:24 AM, Wilhelm wrote:
> Hey there,
>
> 1.) I'm loading 2 avro files with that have slightly different schema
>
> df1 = sqlc.load(file1, "c
I'll add that simple type promotion is done automatically when the types
are compatible (i.e. Int -> Long).
On Tue, May 5, 2015 at 5:55 PM, Michael Armbrust
wrote:
> You need to add a select clause to at least one dataframe to give them the
> same schema before you can union them (much like in S
hi all:
I’ve met a issues with MLlib.I used posted to the community seems put the wrong
place:( .Then I put in stackoverflowf.for a good format details plz
seehttp://stackoverflow.com/questions/30048344/spark-mllib-libsvm-isssues-with-data.hope
someone could help 😢
I guess it’s due to my data.bu
Are you using Kryo or Java serialization? I found this post useful:
http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist
If using kryo, you need to register the classes with kryo, something like
this:
sc.registerKryoClasses(Array(
cla
I am running hive queries from HiveContext, for which we need a
hive-site.xml.
Is it possible to replace it with hive-config.xml? I tried but does not
work. Just want a conformation.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-use-hive-confi
If you are interested in multilabel (not multiclass), you might want to take a
look at SPARK-7015 https://github.com/apache/spark/pull/5830/files. It is
supposed to perform one-versus-all transformation on classes, which is usually
how multilabel classifiers are built.
Alexander
From: Joseph B
I am getting on following code
Error:(164, 25) *overloaded method constructor Strategy with alternatives:*
(algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity:
org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses:
Int,maxBins: Int,categoricalFeaturesInfo:
java.ut
Can you give us a bit more information ?
Such as release of Spark you're using, version of Scala, etc.
Thanks
On Tue, May 5, 2015 at 6:37 PM, xweb wrote:
> I am getting on following code
> Error:(164, 25) *overloaded method constructor Strategy with alternatives:*
> (algo: org.apache.spark.ml
Hi all,
We're trying to implement SparkSQL on CDH5.3.0 with cluster mode,
and we get this error either using java or python;
Application application_1430482716098_0607 failed 2 times due to AM
Container for appattempt_1430482716098_0607_02 exited with exitCode: 10
due to: Exception from co
What Spark tarball are you using? You may want to try the one for hadoop
2.6 (the one for hadoop 2.4 may cause that issue, IIRC).
On Tue, May 5, 2015 at 6:54 PM, felicia wrote:
> Hi all,
>
> We're trying to implement SparkSQL on CDH5.3.0 with cluster mode,
> and we get this error either using ja
I am not using kyro. I was using the regular sqlcontext.avrofiles to open.
The files loads properly with the schema. Exception happens when I try to
read it. Will try kyro serializer and see if that helps.
On May 5, 2015 9:02 PM, "Todd Nist" wrote:
> Are you using Kryo or Java serialization? I f
I am using Spark 1.3.0 and Scala 2.10.
Thanks
On Tue, May 5, 2015 at 6:48 PM, Ted Yu wrote:
> Can you give us a bit more information ?
> Such as release of Spark you're using, version of Scala, etc.
>
> Thanks
>
> On Tue, May 5, 2015 at 6:37 PM, xweb wrote:
>
>> I am getting on following code
Thanks much for your help.
Here's what was happening ...
The HDP VM was running in VirtualBox and host was connected to the guest VM
in NAT mode. When I connected this in "Bridged Adapter" mode it worked !
On Tue, May 5, 2015 at 8:54 PM, ayan guha wrote:
> Try to add one more data node or make
Hi all,
I am planning to load data from Kafka to HDFS. Is it normal to use spark
streaming to load data from Kafka to HDFS? What are concerns on doing this?
There are no processing to be done by Spark, only to store data to HDFS
from Kafka for storage and for further Spark processing
Rendy
why not try https://github.com/linkedin/camus - camus is kafka to HDFS
pipeline
On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior <
rendy.b.jun...@gmail.com> wrote:
> Hi all,
>
> I am planning to load data from Kafka to HDFS. Is it normal to use spark
> streaming to load data from Kafka to HD
update status after i did some tests. I modified some other parameters, found 2
parameters maybe relative.spark_worker_instance and spark.sql.shuffle.partitions
before Today I used default setting of spark_worker_instance and
spark.sql.shuffle.partitions whose value is 1 and 200.At that time , my
Hi Oleg,
For 1, RDD#union will help. You can iterate over folders and union the obtained
RDD along.
For 2, seems like it won’t work in a deterministic way according to this
discussion(http://stackoverflow.com/questions/24871044/in-spark-what-does-the-parameter-minpartitions-works-in-sparkcontex
Here's one of the settings that i used for a closed environment:
.set("spark.blockManager.port","40010")
.set("spark.broadcast.port","40020")
.set("spark.driver.port","40030")
.set("spark.executor.port","40040")
.set("spark.fileserver.port","40050")
.set("spark.replClassSer
Did you set `--driver-memory` with spark-submit? -Xiangrui
On Mon, May 4, 2015 at 5:16 PM, Vinay Muttineni wrote:
> Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760).
> The spark (1.3.1) job is allocated 120 executors with 6GB each and the
> driver also has 6GB.
> Spark C
93 matches
Mail list logo