+1. It sounds awesome!
Kiran Kumar Dusi 于2024年3月21日周四 14:16写道:
> +1
>
> On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri <
> farsheed.asho...@gmail.com> wrote:
>
>> +1
>>
>> On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh,
>> wrote:
>>
>>> Some of you may be aware that Databricks community Home | Dat
t; wrote:
>>
>>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
>>> supported on GCS? IIRC it wasn't, but you could check with GCP support
>>>
>>>
>>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev
>>> wrote:
>
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch.
The definitions for these flags are available here -
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote:
> No, I am using Spark 2.4
I think I found the issue, Hive metastore 2.3.6 doesn't have the necessary
support. After upgrading to Hive 3.1.2 I was able to run the select query.
On Sun, 20 Dec 2020 at 12:00, Jay wrote:
> Thanks Matt.
>
> I have set the two configs in my sparkConfig as bel
uot;SELECT * FROM $tableName")
res2: org.apache.spark.sql.DataFrame = [col1: int]
But when I try to do .show() it returns an error
scala> spark.sql(s"SELECT * FROM $tableName").show()
org.apache.spark.sql.AnalysisException: Table does not support reads:
default.tblname_3;
Hi All -
I have currently setup a Spark 3.0.1 cluster with delta version 0.7.0 which
is connected to an external hive metastore.
I run the below set of commands :-
val tableName = tblname_2
spark.sql(s"CREATE TABLE $tableName(col1 INTEGER) USING delta
options(path='GCS_PATH')")
*20/12/19 17:30:
I am trying to integrate spark with hive on HDInsight spark cluster .
I copied hive-site.xml in spark/conf directory. In addition I added hive
metastore properties like jdbc connection info on Ambari as well. But still the
database and tables created using spark-sql are not visible in hive. Cha
Are you running this in local mode or cluster mode ? If you are running in
cluster mode have you ensured that numpy is present on all nodes ?
On Tue 5 Jun, 2018, 2:43 AM @Nandan@,
wrote:
> Hi ,
> I am getting error :-
>
> --
I might have missed it but can you tell if the OOM is happening in driver
or executor ? Also it would be good if you can post the actual exception.
On Tue 5 Jun, 2018, 1:55 PM Nicolas Paris, wrote:
> IMO your json cannot be read in parallell at all then spark only offers
> you
> to play again w
The partitionBy clause is used to create hive folders so that you can point
a hive partitioned table on the data .
What are you using the partitionBy for ? What is the use case ?
On Mon 4 Jun, 2018, 4:59 PM purna pradeep, wrote:
> im reading below json in spark
>
> {"bucket": "B01", "action
Can you tell us what version of Spark you are using and if Dynamic
Allocation is enabled ?
Also, how are the files being read ? Is it a single read of all files using
a file matching regex or are you running different threads in the same
pyspark job?
On Mon 4 Jun, 2018, 1:27 PM Shuporno Choudhu
Benjamin,
The append will append the "new" data to the existing data with removing
the duplicates. You would need to overwrite the file everytime if you need
unique values.
Thanks,
Jayadeep
On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim wrote:
> I have a situation where I trying to add only new r
The code that you see in github is for version 2.1. For versions below that
the default partitions for binary files is set to 2 which you can change by
using the minPartitions value. I am not sure starting 2.1 how the
minPartitions column will work because as you said the field is completely
ignore
The code that you see in github is for version 2.1. For versions below that
the default partitions for binary files is set to 2 which you can change by
using the minPartitions value. I am not sure starting 2.1 how the
minPartitions column will work because as you said the field is completely
ignore
Hi,
The problem has been solved simply by updating the scala sdk version from
incompactible 2.10.x to correct version 2.11.x
From: Michael Jay
Sent: Tuesday, August 9, 2016 10:11:12 PM
To: user@spark.apache.org
Subject: spark 2.0 in intellij
Dear all,
I am
Dear all,
I am Newbie to Spark. Currently I am trying to import the source code of Spark
2.0 as a Module to an existing client project.
I have imported Spark-core, Spark-sql and Spark-catalyst as maven dependencies
in this client project.
During compilation errors as missing SqlBaseParser.java
arkContext.(JavaSparkContext.scala:58)
--
jay vyas
:675)
I'm not too worries about this - but it seems like it might be nice if
maybe we could specify a user name as part of sparks context or as part of
an external parameter rather then having to
use the java based user/group extractor.
--
jay vyas
Thank you, that helps a lot.
On Mon, Feb 22, 2016 at 6:01 PM, Takeshi Yamamuro
wrote:
> You're correct, reduceByKey is just an example.
>
> On Tue, Feb 23, 2016 at 10:57 AM, Jay Luan wrote:
>
>> Could you elaborate on how this would work?
>>
>> So from wh
Could you elaborate on how this would work?
So from what I can tell, this maps a key to a tuple which always has a 0 as
the second element. From there the hash widely changes because we now hash
something like ((1,4), 0) and ((1,3), 0). Thus mapping this would create
more even partitions. Why redu
Thanks for the reply, I'd like to export the decision splits for each tree
out to an external file which is read elsewhere not using spark. As far as
I know, saving a model to a path will save a bunch of binary files which
can be loaded back into spark. Is this correct?
On Feb 10, 2016 7:21 PM, "Mo
determine the root
cause, as I cannot replicate this issue.
Thanks for your help.
From: Ted Yu mailto:yuzhih...@gmail.com>>
Date: Friday, February 5, 2016 at 5:40 PM
To: Jay Shipper mailto:shipper_...@bah.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
could
confirm they’re having this issue with Spark 1.6.0. Ideally, we should also
have some simple proof of concept that can be posted with the bug.
From: Ted Yu mailto:yuzhih...@gmail.com>>
Date: Wednesday, February 3, 2016 at 3:57 PM
To: Jay Shipper mailto:shipper_...@bah.com>&g
One quick update on this: The NPE is not happening with Spark 1.5.2, so this
problem seems specific to Spark 1.6.0.
From: Jay Shipper mailto:shipper_...@bah.com>>
Date: Wednesday, February 3, 2016 at 12:06 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto
Right, I could already tell that from the stack trace and looking at Spark’s
code. What I’m trying to determine is why that’s coming back as null now, just
from upgrading Spark to 1.6.0.
From: Ted Yu mailto:yuzhih...@gmail.com>>
Date: Wednesday, February 3, 2016 at 12:04 PM
To: Jay S
match what Spark 1.6.0 uses (2.6.0-cdh5.7.0-SNAPSHOT).
Thanks,
Jay
Ah, thank you so much, this is perfect
On Fri, Nov 20, 2015 at 3:48 PM, Ali Tajeldin EDU
wrote:
> You can try to use an Accumulator (
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulator)
> to keep count in map1. Note that the final count may be higher than th
The answer is that my table was not serialized by kyro,but I started
spark-sql shell with kyro,so the data could not be deserialized。
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/insert-overwrite-table-phonesall-in-spark-sql-resulted-in-java-io-StreamCorr
urantees that you'll have a producer and a consumer, so that you
don't get a starvation scenario.
On Wed, Aug 12, 2015 at 7:31 PM, Mohit Anchlia
wrote:
> Is there a way to run spark streaming methods in standalone eclipse
> environment to test out the functionality?
>
--
jay vyas
In general the simplest way is that you can use the Dynamo Java API as is and
call it inside a map(), and use the asynchronous put() Dynamo api call .
> On Aug 7, 2015, at 9:08 AM, Yasemin Kaya wrote:
>
> Hi,
>
> Is there a way using DynamoDB in spark application? I have to persist my
> res
On Tue, Jul 21, 2015 at 11:11 PM, Dogtail Ray wrote:
> Hi,
>
> I have modified some Hadoop code, and want to build Spark with the
> modified version of Hadoop. Do I need to change the compilation dependency
> files? How to then? Great thanks!
>
--
jay vyas
My spark-sql command:
spark-sql --driver-memory 2g --master spark://hadoop04.xx.xx.com:8241 --conf
spark.driver.cores=20 --conf spark.cores.max=20 --conf
spark.executor.memory=2g --conf spark.driver.memory=2g --conf
spark.akka.frameSize=500 --conf spark.eventLog.enabled=true --conf
spark.eventLog.
unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
jay vyas
/latest/api/scala/org/apache/spark/streaming/StreamingContext.html>.
> The information on consumed offset can be recovered from the checkpoint.
>
> On Tue, May 19, 2015 at 2:38 PM, Bill Jay
> wrote:
>
>> If a Spark streaming job stops at 12:01 and I resume the job at 12:02.
>&
ade code.
>
> On Tue, May 19, 2015 at 12:42 PM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I am currently using Spark streaming to consume and save logs every hour
>> in our production pipeline. The current setting is to run a crontab job to
>> check every minute
Hi all,
I am currently using Spark streaming to consume and save logs every hour in
our production pipeline. The current setting is to run a crontab job to
check every minute whether the job is still there and if not resubmit a
Spark streaming job. I am currently using the direct approach for Kafk
Hi all,
I am reading the docs of receiver-based Kafka consumer. The last parameters
of KafkaUtils.createStream is per topic number of Kafka partitions to
consume. My question is, does the number of partitions for topic in this
parameter need to match the number of partitions in Kafka.
For example
each batch.
On Thu, Apr 30, 2015 at 11:15 AM, Cody Koeninger wrote:
> Did you use lsof to see what files were opened during the job?
>
> On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay
> wrote:
>
>> The data ingestion is in outermost portion in foreachRDD block. Although
>>
t's
> impossible, but I'd think we need some evidence before speculating this has
> anything to do with it.
>
>
> On Wed, Apr 29, 2015 at 6:50 PM, Bill Jay
> wrote:
>
>> This function is called in foreachRDD. I think it should be running in
>> the executors. I
> On Wed, Apr 29, 2015 at 2:30 PM, Ted Yu wrote:
>
>> Maybe add statement.close() in finally block ?
>>
>> Streaming / Kafka experts may have better insight.
>>
>> On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay
>> wrote:
>>
>>> Thanks for the suggestion.
rity/limits.conf*
> Cheers
>
> On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I am using the direct approach to receive real-time data from Kafka in
>> the following link:
>>
>> https://spark.apache.org/docs/1.3.0/stream
Hi all,
I am using the direct approach to receive real-time data from Kafka in the
following link:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
My code follows the word count direct example:
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/
>>>> import org.apache.spark.streaming.StreamingContext._
>>>>> import org.apache.spark.streaming.dstream.DStream
>>>>> import org.apache.spark.streaming.Duration
>>>>> import org.apache.spark.streaming.Seconds
>>>>> val ssc = new StreamingContext( sc, Seconds(1))
>>>>> val lines = ssc.socketTextStream("hostname",)
>>>>> lines.print()
>>>>> ssc.start()
>>>>> ssc.awaitTermination()
>>>>>
>>>>> Jobs are getting created when I see webUI but nothing gets printed on
>>>>> console.
>>>>>
>>>>> I have started a nc script on hostname port and can see messages
>>>>> typed on this port from another console.
>>>>>
>>>>>
>>>>>
>>>>> Please let me know If I am doing something wrong.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
--
jay vyas
park
spark-mllib_2.11
1.3.0
org.apache.spark
spark-sql_2.11
1.3.0
I am using scala version 2.11.2.
Could it be that "spark-1.3.0-bin-hadoop2.4.tgz requires a different version
of scala ?
Thanks,
Jay
On Apr 9, 2015, at 4:38 PM, Xiang
ster local[8] ALSNew.jar /input_path
The stack trace is exactly same.
Thanks,
Jay
On Apr 8, 2015, at 10:47 AM, Jay Katukuri wrote:
> some additional context:
>
> Since, I am using features of spark 1.3.0, I have downloaded spark 1.3.0 and
> used spark-submit from there.
>
-submit from my downloaded
spark-1.30.
On Apr 6, 2015, at 1:37 PM, Jay Katukuri wrote:
> Here is the command that I have used :
>
> spark-submit —class packagename.ALSNew --num-executors 100 --master yarn
> ALSNew.jar -jar spark-sql_2.11-1.3.0.jar hdfs://input_path
>
> Btw
s.html
>
> -Xiangrui
>
> On Mon, Apr 6, 2015 at 12:27 PM, Jay Katukuri wrote:
>> Hi,
>>
>> Here is the stack trace:
>>
>>
>> Exception in thread "main" java.lang.NoSuchMethodError:
>> scala.reflect.api.JavaUniverse.runtimeMirror(L
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Thanks,
Jay
On Apr 6, 2015, at 12:24 PM, Xiangrui Meng wrote:
> Please attach the full stack trace. -Xiangrui
>
> On Mon, Apr 6, 2015 at 12:
al ratings = purchase.map ( line =>
line.split(',') match { case Array(user, item, rate) =>
(user.toInt, item.toInt, rate.toFloat)
}).toDF()
Any help is appreciated !
I have tried passing the spark-sql jar using the -jar spark-sql_2.11-1.3.0.jar
Thanks,
Jay
On
Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
jay vyas
Just the same as spark was disrupting the hadoop ecosystem by changing the
assumption that "you can't rely on memory in distributed analytics"...now
maybe we are challenging the assumption that "big data analytics need to
distributed"?
I've been asking the same question lately and seen similarly t
actory.(PoolingHttpClientConnectionManager.java:494)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:149)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:138)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:114)
>
>
--
jay vyas
-https://wiki.apache.org/incubator/IgniteProposal has I think been updated
recently and has a good comparison.
- Although grid gain has been around since the spark days, Apache Ignite is
quite new and just getting started I think so
- you will probably want to reach out to the developers for
ons/spark-streaming.html
> .
>
> With a kafka receiver that pulls data from a single kafka partition of a
> kafka topic, are individual messages in the microbatch in same the order as
> kafka partition? Are successive microbatches originating from a kafka
> partition executed in order?
&
Ah, nevermind, I just saw
http://spark.apache.org/docs/1.2.0/sql-programming-guide.html (language
integrated queries) which looks quite similar to what i was thinking
about. I'll give that a whirl...
On Wed, Feb 11, 2015 at 7:40 PM, jay vyas
wrote:
> Hi spark. is there anything in t
}).join(ProductMetaData).by(product,meta=>product.id=meta.id). toSchemaRDD ?
I know the above snippet is totally wacky but, you get the idea :)
--
jay vyas
hive dates just for
dealing with time stamps.
Whats the simplest and cleanest way to map non-spark time values into
SparkSQL friendly time values?
- One option could be a custom SparkSQL type, i guess?
- Any plan to have native spark sql support for Joda Time or (yikes)
java.util.Calendar ?
--
jay vyas
efore performing any operations on it. That
way, the EdgeRDDImpl class doesn't have to use the default partitioner.
Hope this helps!
Jay
On Tue Feb 03 2015 at 12:35:14 AM NicolasC wrote:
> On 01/29/2015 08:31 PM, Ankur Dave wrote:
> > Thanks for the reminder. I just crea
Just curious, is this set to be merged at some point?
On Thu Jan 22 2015 at 4:34:46 PM Ankur Dave wrote:
> At 2015-01-22 02:06:37 -0800, NicolasC wrote:
> > I try to execute a simple program that runs the ShortestPaths algorithm
> > (org.apache.spark.graphx.lib.ShortestPaths) on a small grid gr
Its a very valid idea indeed, but... It's a tricky subject since the entire
ASF is run on mailing lists , hence there are so many different but equally
sound ways of looking at this idea, which conflict with one another.
> On Jan 21, 2015, at 7:03 AM, btiernay wrote:
>
> I think this is a re
I find importing a working SBT project into IntelliJ is the way to go.
How did you load the project into intellij?
> On Jan 13, 2015, at 4:45 PM, Enno Shioji wrote:
>
> Had the same issue. I can't remember what the issue was but this works:
>
> libraryDependencies ++= {
> val sparkVers
Hi enno. Might be worthwhile to cross post this on dev@hadoop... Obviously a
simple spark way to test this would be to change the uri to write to hdfs:// or
maybe you could do file:// , and confirm that the extra slash goes away.
- if it's indeed a jets3t issue we should add a new unit test for
Found a problem in the spark-shell, but can't confirm that it's related to
open issues on Spark's JIRA page. I was wondering if anyone could help
identify if this is an issue or if it's already being addressed.
Test: (in spark-shell)
case class Person(name: String, age: Int)
val peopleList = Lis
this (
https://github.com/apache/spark/pull/1588) was closed.
Is this something I just have to live with when using the REPL? Or is this
covered by something bigger that's being addressed?
Thanks in advance
-Jay
awaiting processing or does it just process them?
>
> Asim
>
--
jay vyas
https://github.com/jayunit100/SparkStreamingCassandraDemo
On this note, I've built a framework which is mostly "pure" so that functional
unit tests can be run composing mock data for Twitter statuses, with just
regular junit... That might be relevant also.
I think at some point we should come
Here's an example of a Cassandra etl that you can follow which should exit on
its own. I'm using it as a blueprint for revolving spark streaming apps on top
of.
For me, I kill the streaming app w system.exit after a sufficient amount of
data is collected.
That seems to work for most any scena
I'm trying to implement a graph algorithm that does a form of path
searching. Once a certain criteria is met on any path in the graph, I
wanted to halt the rest of the iterations. But I can't see how to do that
with the Pregel API, since any vertex isn't able to know the state of other
arbitrary
I'm trying to implement a graph algorithm that does a form of path
searching. Once a certain criteria is met on any path in the graph, I
wanted to halt the rest of the iterations. But I can't see how to do that
with the Pregel API, since any vertex isn't able to know the state of other
arbitrary
hout issues.
>
> Regularly checking 'scheduling delay' and 'total delay' on the Streaming
> tab in the UI is a must. (And soon we will have that on the metrics report
> as well!! :-) )
>
> -kr, Gerard.
>
>
>
> On Thu, Nov 27, 2014 at 8:14 AM, Bill Ja
your use case, the cleanest way to solve this, is by
> asking Spark Streaming "remember" stuff for longer, by using
> streamingContext.remember(). This will ensure that Spark
> Streaming will keep around all the stuff for at least that duration.
> Hope this helps.
>
> TD
&g
Just add one more point. If Spark streaming knows when the RDD will not be
used any more, I believe Spark will not try to retrieve data it will not
use any more. However, in practice, I often encounter the error of "cannot
compute split". Based on my understanding, this is because Spark cleared
ou
and when to use spark submit to execute python
> scripts/module
> Bonus points if one can point an example library and how to run it :)
> Thanks
>
--
jay vyas
> https://github.com/dibbhatt/kafka-spark-consumer
>
> Regards,
> Dibyendu
>
> On Sun, Nov 23, 2014 at 2:13 AM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I am using Spark to consume from Kafka. However, after the job has run
>> for several hours, I saw t
Hi all,
I am using Spark to consume from Kafka. However, after the job has run for
several hours, I saw the following failure of an executor:
kafka.common.ConsumerRebalanceFailedException:
group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31
can't rebalance after 4 retries
This seems pretty standard: your IntelliJ classpath isn't matched to the
correct ones that are used in spark shell
Are you using the SBT plugin? If not how are you putting deps into IntelliJ?
> On Nov 20, 2014, at 7:35 PM, Sanjay Subramanian
> wrote:
>
> hey guys
>
> I am at AmpCamp 2014
Hi all,
I am running a Spark Streaming job. It was able to produce the correct
results up to some time. Later on, the job was still running but producing
no result. I checked the Spark streaming UI and found that 4 tasks of a
stage failed.
The error messages showed that "Job aborted due to stage
sure there’s the a parameter in KafkaUtils.createStream you can specify
> the spark parallelism, also what is the exception stacks.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Bill Jay [mailto:bill.jaypeter...@gmail.com]
> *Sent:* Tuesday, November 18, 2
Thu, Nov 13, 2014 at 5:00 AM, Helena Edelson wrote:
> I encounter no issues with streaming from kafka to spark in 1.1.0. Do you
> perhaps have a version conflict?
>
> Helena
> On Nov 13, 2014 12:55 AM, "Jay Vyas" wrote:
>
>> Yup , very important that n>1 f
at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
jay vyas
the periodic CPU spike - I had a reduceByKey, so was it
> doing that only after all the batch data was in?
>
> Thanks
>
--
jay vyas
should be local[n], n > 1 for
> local mode. Beside there’s a Kafka wordcount example in Spark Streaming
> example, you can try that. I’ve tested with latest master, it’s OK.
>
> Thanks
> Jerry
>
> From: Tobias Pfeiffer [mailto:t...@preferred.jp]
> Sent: Thursday,
Streaming
> example, you can try that. I’ve tested with latest master, it’s OK.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Tobias Pfeiffer [mailto:t...@preferred.jp]
> *Sent:* Thursday, November 13, 2014 8:45 AM
> *To:* Bill Jay
> *Cc:* u...@spark.incubator.apache.
Hi all,
I have a Spark streaming job which constantly receives messages from Kafka.
I was using Spark 1.0.2 and the job has been running for a month. However,
when I am currently using Spark 1.1.0. the Spark streaming job cannot
receive any messages from Kafka. I have not made any change to the co
Hi all,
I have a Spark streaming job which constantly receives messages from Kafka.
I was using Spark 1.0.2 and the job has been running for a month. However,
when I am currently using Spark 1.1.0. the Spark streaming job cannot
receive any messages from Kafka. I have not made any change to the co
Hi spark. I have a set of text files that are dependencies of my app.
They are less than 2mb in total size.
What's the idiom for packaing text file dependencies for a spark based jar
file? Class resources in packages ? Or distributing them separately?
Hi all,
I have a spark streaming job that consumes data from Kafka and produces
some simple operations on the data. This job is run in an EMR cluster with
10 nodes. The batch size I use is 1 minute and it takes around 10 seconds
to generate the results that are inserted to a MySQL database. Howeve
A use case would be helpful?
Batches of RDDs from Streams are going to have temporal ordering in terms of
when they are processed in a typical application ... , but maybe you could
shuffle the way batch iterations work
> On Nov 3, 2014, at 11:59 AM, Josh J wrote:
>
> When I'm outputting the
;>> On Oct 20, 2014, at 3:07 AM, Gerard Maas
>>>>>> wrote:
>>>>>>
>>>>>> Pinging TD -- I'm sure you know :-)
>>>>>>
>>>>>> -kr, Gerard.
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We have been implementing several Spark Streaming jobs that are
>>>>>>> basically processing data and inserting it into Cassandra, sorting it
>>>>>>> among
>>>>>>> different keyspaces.
>>>>>>>
>>>>>>> We've been following the pattern:
>>>>>>>
>>>>>>> dstream.foreachRDD(rdd =>
>>>>>>> val records = rdd.map(elem => record(elem))
>>>>>>> targets.foreach(target => records.filter{record =>
>>>>>>> isTarget(target,record)}.writeToCassandra(target,table))
>>>>>>> )
>>>>>>>
>>>>>>> I've been wondering whether there would be a performance difference
>>>>>>> in transforming the dstream instead of transforming the RDD within the
>>>>>>> dstream with regards to how the transformations get scheduled.
>>>>>>>
>>>>>>> Instead of the RDD-centric computation, I could transform the
>>>>>>> dstream until the last step, where I need an rdd to store.
>>>>>>> For example, the previous transformation could be written as:
>>>>>>>
>>>>>>> val recordStream = dstream.map(elem => record(elem))
>>>>>>> targets.foreach{target => recordStream.filter(record =>
>>>>>>> isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))}
>>>>>>>
>>>>>>> Would be a difference in execution and/or performance? What would
>>>>>>> be the preferred way to do this?
>>>>>>>
>>>>>>> Bonus question: Is there a better (more performant) way to sort the
>>>>>>> data in different "buckets" instead of filtering the data collection
>>>>>>> times
>>>>>>> the #buckets?
>>>>>>>
>>>>>>> thanks, Gerard.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
--
jay vyas
t; To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
jay vyas
On Tue, Oct 21, 2014 at 11:02 AM, jay vyas
wrote:
> Hi Spark ! I found out why my RDD's werent coming through in my spark
> stream.
>
> It turns out you need the onStart() needs to return , it seems - i.e. you
> need to launch the worker part of your
> start process
Hi Spark ! I found out why my RDD's werent coming through in my spark
stream.
It turns out you need the onStart() needs to return , it seems - i.e. you
need to launch the worker part of your
start process in a thread. For example
def onStartMock():Unit ={
val future = new Thread(new
ert a JavaRDD
>> into
>> > an iterator or iterable over then entire data set without using collect
>> or
>> > holding all data in memory.
>> >In many problems where it is desirable to parallelize intermediate
>> steps
>> > but use a single process for handling the final result this could be
>> very
>> > useful.
>>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>
--
jay vyas
Hi spark !
I dont quite yet understand the semantics of RDDs in a streaming context
very well yet.
Are there any examples of how to implement CustomInputDStreams, with
corresponding Receivers in the docs ?
Ive hacked together a custom stream, which is being opened and is
consuming data internal
t;
> Andy
>
>
> import sys
> from operator import add
>
> from pyspark import SparkContext
>
> # only stand alone jobs should create a SparkContext
> sc = SparkContext(appName="pyStreamingSparkRDDPipe”)
>
> data = [1, 2, 3, 4, 5]
> rdd = sc.parallelize(data)
>
> def echo(data):
> print "python recieved: %s" % (data) # output winds up in the shell
> console in my cluster (ie. The machine I launched pyspark from)
>
> rdd.foreach(echo)
> print "we are done"
>
>
>
--
jay vyas
bug and learn.
>
> thanks
>
> sanjay
>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>
>
>
--
jay vyas
is clear in the doc. You should use what Burak
> suggested:
>
> val predictions = model.predict(data.map(x => (x.user, x.product)))
>
> Best,
> Xiangrui
>
> On Thu, Aug 7, 2014 at 1:20 PM, Burak Yavuz wrote:
> > Hi Jay,
> >
> > I've had the same problem you
sting for recommendation models?" Leave it nice and general...
Thanks in advance. Sorry for the long ramble.
Jay
> at java.lang.Class.forName(Class.java:270)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
> at
> org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:89)
> ... 64 more
>
>
> On Sun, Aug 3, 2014 at 6:04 PM, Rahul Bhojwani <
> rahulbhojwani2...@gmail.com> wrote:
>
>> Hi,
>>
>> I used to run spark scripts on local machine. Now i am porting my codes
>> to EMR and i am facing lots of problem.
>>
>> The main one now is that the spark script which is running properly on my
>> local machine is giving error when run on Amazon EMR Cluster.
>> Here is the error:
>>
>>
>>
>>
>>
>> What can be the possible reason?
>> Thanks in advance
>> --
>>
>> [image: http://]
>> Rahul K Bhojwani
>> [image: http://]about.me/rahul_bhojwani
>> <http://about.me/rahul_bhojwani>
>>
>>
>
>
>
> --
>
> [image: http://]
> Rahul K Bhojwani
> [image: http://]about.me/rahul_bhojwani
> <http://about.me/rahul_bhojwani>
>
>
>
--
jay vyas
r JUnit
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
--
jay vyas
1 - 100 of 152 matches
Mail list logo