Hi,
I am interested in building an application that uses sliding windows not
based on the time when the item was received, but on either
* a timestamp embedded in the data, or
* a count (like: every 10 items, look at the last 100 items).
Also, I want to do this on stream data received from Kafka,
Hello,
I've successfully built a very simple Spark Streaming application in Java
that is based on the HdfsCount example in Scala at
https://github.com/apache/spark/blob/branch-1.1/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala
.
When I submit this application to m
Hi,
I know there are not so many conferences on Spark in Paris, so I just
wanted to let you know you that Ippon will be holding one on Thursday next
week (11th of December):
http://blog.ippon.fr/2014/12/03/ippevent-spark-ou-comment-traiter-des-donnees-a-la-vitesse-de-leclair/
There will be 3 talk
Hi,
There have been some efforts going on in providing column level
encryption/decryption on hive tables.
https://issues.apache.org/jira/browse/HIVE-7934
Is there any plan to extend the functionality over sparksql also?
Thanks,
Chirag
Yeah, the dot notation works. It works even for arrays. But I am not sure
if it can handle complex hierarchies.
On Mon Dec 08 2014 at 11:55:19 AM Cheng Lian wrote:
> You may access it via something like SELECT filterIp.element FROM tb,
> just like Hive. Or if you’re using Spark SQL DSL, you can
Hello all,
I am working on a graph problem using vanilla Spark (not GraphX) and at some
point I would like to do a
self join on an edges RDD[(srcID, dstID, w)] on the dst key, in order to get
all pairs of incoming edges.
Since this is the performance bottleneck for my code, I was wondering if
the
I am trying to create (yet another) spark as a service tool that lets you
submit jobs via REST APIs. I think I have nearly gotten it to work baring a
few issues. Some of which seem already fixed in 1.2.0 (like SPARK-2889) but
I have hit the road block with the following issue.
I have created a sim
spark can do efficient joins if both RDDs have the same partitioner. so in
case of self join I would recommend to create an rdd that has explicit
partitioner and has been cached.
On Dec 8, 2014 8:52 AM, "Theodore Vasiloudis" <
theodoros.vasilou...@gmail.com> wrote:
> Hello all,
>
> I am working on
Hi Guys,
I used applySchema to store a set of nested dictionaries and lists in a
parquet file.
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-td20228.html#a20461
It was successful and i could successf
I am relatively new to Spark. I am planning to use Spark Streaming for my
OLAP use case, but I would like to know how RDDs are shared between multiple
workers.
If I need to constantly compute some stats on the streaming data, presumably
shared state would have to updated serially by different spar
Could you not use a groupByKey instead of the join? I mean something like
this:
val byDst = rdd.map { case (src, dst, w) => dst -> (src, w) }
byDst.groupByKey.map { case (dst, edges) =>
for {
(src1, w1) <- edges
(src2, w2) <- edges
} {
??? // Do something.
}
??? // Return somet
As a tempary fix, it works when I convert field six to a list manually. That
is:
def generateRecords(line):
# input : the row stored in parquet file
# output : a python dictionary with all the key value pairs
field1 = line.field1
summary = {}
summary['
Using groupByKey was our first approach, and as noted in the docs is highly
inefficient due to the need to shuffle all the data. See
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
On Mon, Dec 8, 2014 at 3:47 PM, Daniel D
I do not see how you hope to generate all incoming edge pairs without
repartitioning the data by dstID. You need to perform this shuffle for
joining too. Otherwise two incoming edges could be in separate partitions
and never meet. Am I missing something?
On Mon, Dec 8, 2014 at 3:53 PM, Theodore Va
Hi,
I'm confused about the Stage times reported on the Spark-UI (Spark 1.1.0)
for an Spark-Streaming job. I'm hoping somebody can shine some light on it:
Let's do this with an example:
On the /stages page, stage # 232 is reported to have lasted 18 seconds:
232runJob at RDDFunctions.scala:23
Hi Julius,
You can add those external jars to spark while creating the sparkContext
(sc.addJar("/path/to/the/jar")), if you are submitting the job using
spark-submit then you can use the --jars option and get those jars shipped.
Thanks
Best Regards
On Sun, Dec 7, 2014 at 11:05 PM, Julius K wrot
Check in your worker logs for exact reason, if you let the job run for 2
days then mostly this is because of you ran out of disk space or something.
Looking at the worker logs will give you a clear picture.
Thanks
Best Regards
On Mon, Dec 8, 2014 at 12:49 PM, Hafiz Mujadid
wrote:
> I am facing
You can setup and customize nagios for all these requirements. Or you can
use Ganglia if you are not looking for something with alerts (email etc)
Thanks
Best Regards
On Mon, Dec 8, 2014 at 1:05 PM, Judy Nash
wrote:
> Hello,
>
>
>
> Are there ways we can programmatically get health status of m
How are you submitting the job? You need to create a jar of your code (sbt
package will give you one inside target/scala-*/projectname-*.jar) and then
use it while submitting. If you are not using spark-submit then you can
simply add this jar to spark by
sc.addJar("/path/to/target/scala*/projectnam
I went through complex hierarchal JSON structures and Spark seems to fail in
querying them no matter what syntax is used.
Hope this helps,
Regards,
Alessandro
> On Dec 8, 2014, at 6:05 AM, Raghavendra Pandey
> wrote:
>
> Yeah, the dot notation works. It works even for arrays. But I am not
@Daniel
It's true that the first map in your code is needed, i.e. mapping so that
dstID is the new RDD key.
The self-join on the dstKey will then create all the pairs of incoming
edges (plus self-referential and duplicates that need to be filtered out).
@Koert
Are there any guidelines about setti
On Mon, Dec 8, 2014 at 5:26 PM, Theodore Vasiloudis <
theodoros.vasilou...@gmail.com> wrote:
> @Daniel
> It's true that the first map in your code is needed, i.e. mapping so that
> dstID is the new RDD key.
>
You wrote groupByKey is "highly inefficient due to the need to shuffle all
the data", bu
My thesis is related to big data mining and I have a cluster in the
laboratory of my university. My task is to install apache spark on it and
use it for extraction purposes. Is there any understandable guidance on how
to do this ?
--
View this message in context:
http://apache-spark-user-list.
On a rough note,
Step 1: Install Hadoop2.x in all the machines on cluster
Step 2: Check if Hadoop cluster is working
Step 3: Setup Apache Spark as given on the documentation page for the
cluster.
Check the status of cluster on the master UI
As it is some data mining project, configure Hive too.
Y
Hi,
I am intending to save the streaming data from kafka into Cassandra,
using spark-streaming:
But there seems to be problem with line
javaFunctions(data).writerBuilder("testkeyspace", "test_table",
mapToRow(TestTable.class)).saveToCassandra();
I am getting 2 errors.
the code, the error-log and
This is fixed in 1.2. Also, in 1.2+ you could call row.asDict() to
convert the Row object into dict.
On Mon, Dec 8, 2014 at 6:38 AM, sahanbull wrote:
> Hi Guys,
>
> I used applySchema to store a set of nested dictionaries and lists in a
> parquet file.
>
> http://apache-spark-user-list.1001560.n3
@Daniel
Not an expert either, I'm just going by what I see performance-wise
currently. Our groupByKey implementation was more than an order of
magnitude slower than using the self join and then reduceByKey.
FTA:
*"pairs on the same machine with the same key are combined (by using the
lamdba funct
Hi,
https://github.com/databricks/spark-perf/tree/master/streaming-tests/src/main/scala/streaming/perf
contains some performance tests for streaming. There are examples of how to
generate synthetic files during the test in that repo, maybe you
can find some code snippets that you can use there.
Hello Everyone,
I'm brand new to spark and was wondering if there's a JDBC driver to access
spark-SQL directly. I'm running spark in standalone mode and don't have
hadoop in this environment.
--
*Best Regards/أطيب المنى,*
*Anas Mosaad*
I have a function which generates a Java object and I want to explore
failures which only happen when processing large numbers of these object.
the real code is reading a many gigabyte file but in the test code I can
generate similar objects programmatically. I could create a small list,
paralleli
You can use thrift server for this purpose then test it with beeline.
See doc:
https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server
From: Anas Mosaad [mailto:anas.mos...@incorta.com]
Sent: Monday, December 8, 2014 11:01 AM
To: user@spark.apache.org
Subje
This is by hive's design. From the Hive documentation:
The column change command will only modify Hive's metadata, and will not
> modify data. Users should make sure the actual data layout of the
> table/partition conforms with the metadata definition.
On Sat, Dec 6, 2014 at 8:28 PM, Jianshi H
Update:
The issue in my previous post was solved:
I had to change the sbt file name from .sbt to build.sbt.
-
Thanks!
-Caron
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/scopt-OptionParser-tp8436p20581.html
Sent from the Apache Spark User List mai
Hey Tobias,
Can you try using the YARN Fair Scheduler and set
yarn.scheduler.fair.continuous-scheduling-enabled to true?
-Sandy
On Sun, Dec 7, 2014 at 5:39 PM, Tobias Pfeiffer wrote:
> Hi,
>
> thanks for your responses!
>
> On Sat, Dec 6, 2014 at 4:22 AM, Sandy Ryza
> wrote:
>>
>> What versio
Hello Jianshi,
You meant you want to convert a Map to a Struct, right? We can extract some
useful functions from JsonRDD.scala, so others can access them.
Thanks,
Yin
On Mon, Dec 8, 2014 at 1:29 AM, Jianshi Huang
wrote:
> I checked the source code for inferSchema. Looks like this is exactly w
Hi,
I think you have the right idea. I would not even worry about flatMap.
val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x =>
generateRandomObject(x))
Then when you try to evaluate something on this RDD, it will happen
partition-by-partition. So 1000 random objects will be generate
OK, have waded into implementing this and have gotten pretty far, but am now
hitting something I don't understand, an NoSuchMethodError.
The code looks like
[...]
val conf = new SparkConf().setAppName(appName)
//conf.set("fs.default.name", "file://");
val sc = new SparkContext(c
Hi All,
I was able to run LinearRegressionwithSGD for a largeer dataset (> 2GB sparse).
I have now filtered the data and I am running regression on a subset of it (~
200 MB). I see this error, which is strange since it was running fine with the
superset data. Is this a formatting issue
You just need to use the latest master code without any configuration
to get performance improvement from my PR.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Mon, Dec 8, 2014 at 7:53 AM
looks good but how do I say that in Java
as far as I can see sc.parallelize (in Java) has only one implementation
which takes a List - requiring an in memory representation
On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:
> Hi,
> I think you have the rig
You can call .schema on SchemaRDDs. For example:
results.schema.fields.map(_.name)
On Sun, Dec 7, 2014 at 11:36 PM, abhishek wrote:
> Hi,
>
> I have iplRDD which is a json, and I do below steps and query through
> hivecontext. I get the results but without columns headers. Is there is a
> way
Hi,
I need to generate some flags based on certain columns and add it back to
the schemaRDD for further operations. Do I have to use case class
(reflection or programmatically). I am using parquet files, so schema is
being automatically derived. This is a great feature. thanks to Spark
developers,
Hi Jake,
The "toString" method should print the full model in versions 1.1.x.
The current master branch has a method "toDebugString" for
DecisionTreeModel which should print out all the node classes and the
"toString" method has been updated to print the summary only so there is a
slight change i
Hi,
Is there any way to auto calculate the optimum learning rate or stepsize via
MLLIB for SGD ?
Thx
tri
At 2014-12-08 12:12:16 -0800, spr wrote:
> OK, have waded into implementing this and have gotten pretty far, but am now
> hitting something I don't understand, an NoSuchMethodError.
> [...]
> The (short) traceback looks like
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache
Hi Bui,
Please use BFGS based solvers...For BFGS you don't have to specify step
size since the line search will find sufficient decrease each time...
Regularization you still have to do grid search...it's not possible to
automate that but on master you will find nice ways to automate grid
search.
I am trying to create a CSV file, I have managed to create the actual string
I want to output to a file, but when I try to write the file, I get the
following error.
/saveAsTextFile is not a member of String/
My string is perfect, when I call this line to actually save the file, I get
the above e
Running JavaAPISuite (master branch) on Linux, I got:
testGuavaOptional(org.apache.spark.JavaAPISuite) Time elapsed: 32.945 sec
<<< ERROR!
org.apache.spark.SparkException: Job aborted due to stage failure: Master
removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org
$apa
Yeah, spark has very little overhead per partition, so generally more
partitions is better.
On Mon, Dec 8, 2014 at 1:46 PM, Theodore Vasiloudis <
theodoros.vasilou...@gmail.com> wrote:
> @Daniel
>
> Not an expert either, I'm just going by what I see performance-wise
> currently. Our groupByKey im
Hi all,
I am running into an out of memory error while running ALS using MLLIB on a
reasonably small data set consisting of around 6 Million ratings.
The stack trace is below:
java.lang.OutOfMemoryError: Java heap space
at org.jblas.DoubleMatrix.(DoubleMatrix.java:323)
at org.jbl
assume I don't care about values which may be created in a later map - in
scala I can say
val rdd = sc.parallelize(1 to 10, numSlices = 1000)
but in Java JavaSparkContext can only paralellize a List - limited to
Integer,MAX_VALUE elements and required to exist in memory - the best I can
do
Hi -
Does anybody have any ideas how to dynamically allocate cores instead of
statically partitioning them among multiple applications? Thanks.
Mohammed
From: Mohammed Guller
Sent: Friday, December 5, 2014 11:26 PM
To: user@spark.apache.org
Subject: Fair scheduling accross applications in stand
Hi,
I'm wondering whether there is an efficient way to continuously append
new data to a registered spark SQL table.
This is what I want: I want to make an ad-hoc query service to a
json formated system log. Certainly, the system log is continuously generated.
I will use spark
I'm using CDH 5.1.0 and Spark 1.0.0, and I'd like to write out data
as snappy-compressed files but encounted a problem.
My code is as follows:
val InputTextFilePath = "hdfs://ec2.hadoop.com:8020/xt/text/new.txt"
val OutputTextFilePath = "hdfs://ec2.hadoop.com:8020/xt/compressedText/"
val
Steve, Something like this will do I think => sc.parallelize(1 to 1000,
1000).flatMap(x => 1 to 10)
the above will launch 1000 tasks (maps), with each task creating 10^5
numbers (total of 100 million elements)
On Mon, Dec 8, 2014 at 6:17 PM, Steve Lewis wrote:
> assume I don't care about
I have a question as the title says, the question link is
http://stackoverflow.com/questions/27370170/query-classification-using-apache-spark-mlib,thanks
Jin
Pretty straightforward: Using Scala, I have an RDD that represents a table
with four columns. What is the recommended way to convert the entire RDD to
one JSON object?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-RDD-to-JSON-tp20585.html
S
Hi,
On Tue, Dec 9, 2014 at 4:39 AM, Sandy Ryza wrote:
>
> Can you try using the YARN Fair Scheduler and set
> yarn.scheduler.fair.continuous-scheduling-enabled to true?
>
I'm using Cloudera 5.2.0 and my configuration says
yarn.resourcemanager.scheduler.class =
org.apache.hadoop.yarn.server.re
If you are using spark SQL in 1.2, you can use toJson to convert a
SchemaRDD to an RDD[String] that contains one JSON object per string value.
Thanks,
Yin
On Mon, Dec 8, 2014 at 11:52 PM, YaoPau wrote:
> Pretty straightforward: Using Scala, I have an RDD that represents a table
> with four col
Hi Huai,
Exactly, I'll probably implement one using the new data source API when I
have time... I've found the utility functions in JsonRDD.
Jianshi
On Tue, Dec 9, 2014 at 3:41 AM, Yin Huai wrote:
> Hello Jianshi,
>
> You meant you want to convert a Map to a Struct, right? We can extract
> som
Ah... I see. Thanks for pointing it out.
Then it means we cannot mount external table using customized column names.
hmm...
Then the only option left is to use a subquery to add a bunch of column
alias. I'll try it later.
Thanks,
Jianshi
On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust
wrote:
You don't need to worry about locks as such as one thread/worker is
responsible exclusively for one partition of the RDD. You can use
Accumulator variables that spark provides to get the state updates.
On Mon Dec 08 2014 at 8:14:28 PM aditya.athalye
wrote:
> I am relatively new to Spark. I am pl
Hi Experts!
I want to save DStream to HDFS only if it is not empty such that it contains
some kafka messages to be stored. What is an efficient way to do this.
var data = KafkaUtils.createStream[Array[Byte], Array[Byte],
DefaultDecoder, DefaultDecoder](ssc, params, topicMap,
Storag
I am facing a somewhat confusing problem:
My spark app reads data from a database, calculates certain values and then
runs a shortest path Pregel operation on them. If I save the RDD to disk and
then read the information out again, my app runs between 30-50% faster than
keeping it in memory, plus
Hi allIn my spark application,I load a csv file and map the datas to a Map
vairable for later uses on driver node ,then broadcast it,every thing works
fine untill the exception java.io.FileNotFoundException occurs.the console log
information shows me the broadcast unavailable,I googled this
You cannot pass the sc object (*val b = Utils.load(sc,ip_lib_path)*) inside
a map function and that's why the Serialization exception is popping up(
since sc is not serializable). You can try tachyon's cache if you want to
persist the data in memory kind of forever.
Thanks
Best Regards
On Tue, De
Hi yuemeng,
Are you possibly running the Capacity Scheduler with the default resource
calculator?
-Sandy
On Sat, Dec 6, 2014 at 7:29 PM, yuemeng1 wrote:
> Hi, all
> When i running an app with this cmd: ./bin/spark-sql --master
> yarn-client --num-executors 2 --executor-cores 3, i noticed
Another thing to be aware of is that YARN will round up containers to the
nearest increment of yarn.scheduler.minimum-allocation-mb, which defaults
to 1024.
-Sandy
On Sat, Dec 6, 2014 at 3:48 PM, Denny Lee wrote:
> Got it - thanks!
>
> On Sat, Dec 6, 2014 at 14:56 Arun Ahuja wrote:
>
>> Hi Den
Hi All,
I am facing following problem on Spark-1.2 rc1 where I get Treenode
exception (unresolved attributes) :-
https://issues.apache.org/jira/browse/SPARK-2063
To avoid this, I do something following :-
val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
existingSchemaRDD.schema)
It
I meet the same issue. Any solution?
On Wed, Nov 12, 2014 at 2:54 PM, tridib wrote:
> Hi Friends,
> I am trying to save a json file to parquet. I got error "Unsupported
> datatype TimestampType".
> Is not parquet support date? Which parquet version does spark uses? Is
> there
> any work around?
RDD is just a wrap of the scala collection, Maybe you can use the
.collect() method to get the scala collection type, you can then transfer
to a JSON object using the Scala method.
Hi all:I was running HiveFromSpark on yarn-cluster. While I got the hive
select's result schemaRDD and tried to run `collect()` on it, the application
got stuck and don't know what's wrong with it. Here is my code:
val sqlStat = s"SELECT * FROM $TABLE_NAME" val result =
hiveContext.hql(sqlStat)
72 matches
Mail list logo