In #1, class HTable can not be serializable.
You also need to check you self defined function getUserActions and make sure
it is a member function of one class who implement serializable interface.
发自我的 iPad
> 在 2014年12月12日,下午4:35,yangliuyu 写道:
>
> The scenario is using HTable instance to scan
Hi Helena and All,
I have a below example JSON file format. My use case is to read "NAME"
variable.
When I execute I got next exception
*"Exception in thread "main"
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes: 'NAME, tree:Project ['NAME] Subquery device"
I'll add that there is an experimental method that allows you to start the
JDBC server with an existing HiveContext (which might have registered
temporary tables).
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServ
I'm happy to discuss what it would take to make sure we can propagate this
information correctly. Please open a JIRA (and mention me in it).
Regarding including it in 1.2.1, it depends on how invasive the change ends
up being, but it is certainly possible.
On Thu, Dec 11, 2014 at 3:55 AM, nitin
Pay attention to your JSON file, try to change it like following.
Each record represent as a JSON string.
{"NAME" : "Device 1",
"GROUP" : "1",
"SITE" : "qqq",
"DIRECTION" : "East",
}
{"NAME" : "Device 2",
"GROUP" : "2",
"SITE" : "sss",
"DIRECTION" : "
Why doesn't it work?? I guess that it's the same with "\n".
2014-12-13 12:56 GMT+01:00 Guillermo Ortiz :
> I got it, thanks,, a silly question,, why if I do:
> out.write("hello " + System.currentTimeMillis() + "\n"); it doesn't
> detect anything and if I do
> out.println("hello " + System.current
Are you using a bufferedPrintWriter? that's probably a different flushing
behaviour. Try doing out.flush() after out.write(...) and you will have
the same result.
This is Spark unrelated btw.
-kr, Gerard.
Hi Sourabh,
have a look at https://issues.apache.org/jira/browse/SPARK-1406, I am
looking into exporting models in PMML using JPMML.
Regards,
Vincenzo
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-model-export-PMML-vs-MLLIB-serialization-tp20324p20
Thanks.
2014-12-14 12:20 GMT+01:00 Gerard Maas :
> Are you using a bufferedPrintWriter? that's probably a different flushing
> behaviour. Try doing out.flush() after out.write(...) and you will have the
> same result.
>
> This is Spark unrelated btw.
>
> -kr, Gerard.
--
The reason is because of the following code:
val numStreams = numShards
val kinesisStreams = (0 until numStreams).map { i =>
KinesisUtils.createStream(ssc, streamName, endpointUrl,
kinesisCheckpointInterval,
InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2)
}
In the above co
Thank you Yanbo
Regards,
Rajesh
On Sun, Dec 14, 2014 at 3:15 PM, Yanbo wrote:
>
> Pay attention to your JSON file, try to change it like following.
> Each record represent as a JSON string.
>
> {"NAME" : "Device 1",
> "GROUP" : "1",
> "SITE" : "qqq",
> "DIRECTION" : "East"
Hi,
My environment is: standalone spark 1.1.1 on windows 8.1 pro.
The following case works fine:
>>> a = [1,2,3,4,5,6,7,8,9]
>>> b = []
>>> for x in range(10):
... b.append(a)
...
>>> rdd1 = sc.parallelize(b)
>>> rdd1.first()
>>>[1, 2, 3, 4, 5, 6, 7, 8, 9]
The following case does not work.
Can somebody throw light on MLlib vs Madlib?
Which is better for machine learning? and are there any specific use case
scenarios MLlib or Madlib will shine in?
Regards,
Venkat Ankam
This communication is the property of CenturyLink and may contain confidential
or privileged information. Unautho
I have a large of files within HDFS that I would like to do a group by
statement ala
val table = sc.textFile("hdfs://")
val tabs = table.map(_.split("\t"))
I'm trying to do something similar to
tabs.map(c => (c._(167), c._(110), c._(200))
where I create a new RDD that only has
but that isn't
Hi,
I don't get what the problem is. That map to selected columns looks like
the way to go given the context. What's not working?
Kr, Gerard
On Dec 14, 2014 5:17 PM, "Denny Lee" wrote:
> I have a large of files within HDFS that I would like to do a group by
> statement ala
>
> val table = sc.te
Getting a bunch of syntax errors. Let me get back with the full statement
and error later today. Thanks for verifying my thinking wasn't out in left
field.
On Sun, Dec 14, 2014 at 08:56 Gerard Maas wrote:
> Hi,
>
> I don't get what the problem is. That map to selected columns looks like
> the way
MADLib (http://madlib.net/) was designed to bring large-scale ML techniques to
a relational database, primarily postgresql. MLlib assumes the data exists in
some Spark-compatible data format.
I would suggest you pick the library that matches your data platform first.
DISCLAIMER: I am the origi
Hey,
I am doing an experiment with Spark Streaming consisting of moving data
from Kafka to S3 locations while partitioning by date. I have already
looked into Linked Camus and Pinterest Secor and while both are workable
solutions, it just feels that Spark Streaming should be able to be on par
with
How much executor-memory are you setting for the JVM? What about the Driver
JVM memory?
Also check the Windows Event Log for Out of memory errors for one of the 2
above JVMs.
On Dec 14, 2014 6:04 AM, "genesis fatum" wrote:
> Hi,
>
> My environment is: standalone spark 1.1.1 on windows 8.1 pro.
>
Hi Jean-Pascal,
At Virdata we do a similar thing to 'bucketize' our data to different
keyspaces in Cassandra.
The basic construction would be to filter the DStream (or the underlying
RDD) for each key and then apply the usual storage operations on that new
data set.
Given that, in your case, you
Ah! That sounds very much like what I need. A very basic question (most
likely), why is "rdd.cache()" critical? Isn't it already true that in Spark
Streaming DStream are cached in memory anyway?
Also any experience with minutes long batch interval?
Thanks for the quick answer!
On Sun, Dec 14, 20
I got this error when I click "Track URL: ApplicationMaster" when I run a
spark job in YARN cluster mode. I found this jira
https://issues.apache.org/jira/browse/YARN-800, but I could not get this
problem fixed. I'm running CDH 5.1.0 with Both HDFS and RM HA enabled. Does
anybody has the similar is
I haven't done anything else than performance tuning on Spark Streaming for
the past weeks. rdd.cache makes a huge difference. A must in this case
where you want to iterate over the same RDD several times.
Intuitively, I also thought that all data was in memory already so that
wouldn't make a diff
>
> BTW, I cannot use SparkSQL / case right now because my table has 200
> columns (and I'm on Scala 2.10.3)
>
You can still apply the schema programmatically:
http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
For many operations, Spark SQL will just pass the data through without
looking at it. Caching, in contrast, has to process the data so that we
can build up compressed column buffers. So the schema is mismatched in
both cases, but only the caching case shows it.
Based on the exception, it looks m
Yes - that works great! Sorry for implying I couldn't. Was just more
flummoxed that I couldn't make the Scala call work on its own. Will
continue to debug ;-)
On Sun, Dec 14, 2014 at 11:39 Michael Armbrust
wrote:
> BTW, I cannot use SparkSQL / case right now because my table has 200
>> columns (a
hello all,
we at tresata wrote a library to provide for batch integration between
spark and kafka (distributed write of rdd to kafa, distributed read of rdd
from kafka). our main use cases are (in lambda architecture jargon):
* period appends to the immutable master dataset on hdfs from kafka using
Denny, I am not sure what exception you're observing but I've had luck with
2 things:
val table = sc.textFile("hdfs://")
You can try calling table.first here and you'll see the first line of the
file.
You can also do val debug = table.first.split("\t") which would give you an
array and you ca
Thanks for the info Brian.
I am trying to compare performance difference between "Pivotal HAWQ/Greenplum
with MADlib" vs "HDFS with MLlib".
Do you think Spark MLlib will perform better because of in-memory, caching and
iterative processing capabilities?
I need to perform large scale text analy
I don't have any solid performance numbers, no. Let's start with some questions
* Do you have to do any feature extraction before you start the routine? E.g.
NLP, NER or tokenization? Have you already vectorized?
* Which routine(s) do you wish to use? Things like k-means do very well in a
rela
Hi,
On Fri, Dec 12, 2014 at 7:01 AM, ryaminal wrote:
>
> Now our solution is to make a very simply YARN application which execustes
> as its command "spark-submit --master yarn-cluster
> s3n://application/jar.jar
> ...". This seemed so simple and elegant, but it has some weird issues. We
> get "N
Nathan,
On Fri, Dec 12, 2014 at 3:11 PM, Nathan Kronenfeld <
nkronenf...@oculusinfo.com> wrote:
>
> I can see how to do it if can express the added values in SQL - just run
> "SELECT *,valueCalculation AS newColumnName FROM table"
>
> I've been searching all over for how to do this if my added val
Hi all,
I am trying to run Spark job on Playframework + Spark Master/Worker in one
Mac.
When job ran, I encountered java.lang.ClassNotFoundException.
Would you teach me how to solve it?
Here is my code in Github.
https://github.com/TomoyaIgarashi/spark_cluster_sample
* Envrionments:
Mac 10.9.5
J
I think it could be done like:
1. using mapPartition to randomly drop some partition
2. drop some elements randomly(for selected partition)
3. calculate gradient step for selected elements
I don't think fixed step is needed, but fixed step could be done:
1. zipWithIndex
2. create ShuffleRDD base
Hi spark experts
Are there any Python APIs for Spark Streaming? I didn't find the Python APIs in
Spark Streaming programming guide..
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Xiaoyong
AFAIK, this will be a new feature in version 1.2, you can check out the master
branch or 1.2 branch to take a try.
Thanks
Jerry
From: Xiaoyong Zhu [mailto:xiaoy...@microsoft.com]
Sent: Monday, December 15, 2014 10:53 AM
To: user@spark.apache.org
Subject: Spark Streaming Python APIs?
Hi spark ex
Cool thanks!
Xiaoyong
From: Shao, Saisai [mailto:saisai.s...@intel.com]
Sent: Monday, December 15, 2014 10:57 AM
To: Xiaoyong Zhu
Cc: user@spark.apache.org
Subject: RE: Spark Streaming Python APIs?
AFAIK, this will be a new feature in version 1.2, you can check out the master
branch or 1.2 bran
Btw I have seen the python related docs in the 1.2 doc here:
http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/streaming-programming-guide.html
Xiaoyong
From: Xiaoyong Zhu [mailto:xiaoy...@microsoft.com]
Sent: Monday, December 15, 2014 10:58 AM
To: Shao, Saisai
Cc: user@spark.apache.org
Sub
Hi Xiangrui,
The block size limit was encountered even with reduced number of item
blocks as you had expected. I'm wondering if I could try the new
implementation as a standalone library against a 1.1 deployment. Does it
have dependencies on any core API's in the current master?
Thanks,
Bharath
Oh, just figured it out:
tabs.map(c => Array(c(167), c(110), c(200))
Thanks for all of the advice, eh?!
On Sun Dec 14 2014 at 1:14:00 PM Yana Kadiyska
wrote:
> Denny, I am not sure what exception you're observing but I've had luck
> with 2 things:
>
> val table = sc.textFile("hdfs://")
Thanks TD & Francois for the explanation & documentation. I'm curious if we
have any performance benchmark with & without WAL for spark-streaming-kafka.
Also In spark-streaming-kafka (as kafka provides a way to acknowledge logs)
on top of WAL can we modify KafkaUtils to acknowledge the offsets onl
I am working some kind of Spark MLlib Test(Decision Tree) and I used IRIS
data from Cran-R package.
Original IRIS Data is not a good format for Spark MLlib. so I changed data
format(change data format and features's location)
When I ran sample Spark MLlib code for DT, I met the error like below
Ho
Hi,
It is not a trivial work to acknowledge the offsets when RDD is fully
processed, I think from my understanding only modify the KafakUtils is not
enough to meet your requirement, you need to add a metadata management stuff
for each block/RDD, and track them both in executor-driver side, and
43 matches
Mail list logo