GraphFrames is back with v0.9.2! pip install graphframes-py :)

2025-08-01 Thread Russell Jurney
o if you use GraphFrames, please let us know! We want your help testing new versions before the release. Got questions or concerns? Let us know what you think! Find us on Discord in #graphframes on GraphGeeks <https://discord.com/channels/1162999022819225631/1326257052368113674>, or join

Motif finding tutorial

2025-03-24 Thread Russell Jurney
If you've never used network motifs to explore and evaluate a graph or relational dataset, there is a new tutorial out for GraphFrames that demonstrates the process thoroughly: https://graphframes.github.io/graphframes/docs/_site/motif-tutorial.html Thanks, Russell Jurney | rjur...@graphl

GraphFrames Hackathon - NOW :)

2025-02-20 Thread Russell Jurney
Just a reminder - the GraphFrames hackathon is starting now, 8AM CET and runs until 5PM PST. https://meet.google.com/zom-xudb-xzf Thanks, Russell Jurney | rjur...@graphlet.ai | graphlet.ai | Graphlet AI Blog <https://blog.graphlet.ai/> | LinkedIn <https://linkedin.com/in/russ

Re: GraphFrames Hackathon on Friday, February 21

2025-02-01 Thread Russell Jurney
uTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her > > > On Sat, Feb 1, 2025 at 4:59 PM Russell Jurney > wrote: > >> Please forgive this double email, but of course I forgot the meeting url: >> https://meet.google.com/zom-xudb-xzf >

Re: GraphFrames Hackathon on Friday, February 21

2025-02-01 Thread Russell Jurney
Please forgive this double email, but of course I forgot the meeting url: https://meet.google.com/zom-xudb-xzf See you there on the 21st between 8AM CET and 5PM PST :) Russell On Sat, Feb 1, 2025 at 4:44 PM Russell Jurney wrote: > You are being contacted because you are on the Spark user

GraphFrames Hackathon on Friday, February 21

2025-02-01 Thread Russell Jurney
s/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22help%20wanted%22>. If you have an idea, please file an issue! <https://github.com/graphframes/graphframes/issues/new?template=Blank+issue> We look forward to seeing you there! Thanks, Russell Jurney | rjur...@graphlet.ai | graphlet.ai | G

Re: Drop Python 2 support from GraphFrames?

2025-02-01 Thread Russell Jurney
>> +1 long overdue >> >> Dr Mich Talebzadeh, >> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >> >>view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> >

Re: Drop Python 2 support from GraphFrames?

2025-01-31 Thread Russell Jurney
s://www.youtube.com/user/holdenkarau > Pronouns: she/her > > > On Fri, Jan 31, 2025 at 5:10 PM Russell Jurney > wrote: > >> So... including the Spark user list for a broader perspective on Python 2 >> PySpark users. >> >> I want to remove Python 2 support from G

Drop Python 2 support from GraphFrames?

2025-01-31 Thread Russell Jurney
phframes/issues/490>. Do people really use PySpark 3 [or soon 4] in Python 2? Is this a thing, or is a reference to Python 2 from like GraphFrames' birth in 2016? Thanks, Russell Jurney | rjur...@graphlet.ai | graphlet.ai | Graphlet AI Blog <https://blog.graphlet.ai/> | LinkedIn &l

Help choose a GraphFrames logo

2025-01-15 Thread Russell Jurney
GraphFrames needs a logo, so I created a 99designs contest to create one. There are six finalists. Please vote for the one you like the most :) https://99designs.com/contests/poll/c00e5edaf5 Thanks, Russell Jurney | rjur...@graphlet.ai | graphlet.ai | Graphlet AI Blog <https://blog.graphlet

Re: GraphFrames' ConnectedComponentSuite test 'two components and two dangling vertices' fails with OutOfMemoryError: Java heap space

2025-01-14 Thread Russell Jurney
TestSparkContext.scala from 4 to 10, and the > "checkpoint interval" test ran perfectly without throwing an OOM error. > Why? No idea, but it worked. > > > > El lun, 13 ene 2025 a las 16:45, Russell Jurney () > escribió: > >> Merged, thanks guys! >> >

Re: GraphFrames' ConnectedComponentSuite test 'two components and two dangling vertices' fails with OutOfMemoryError: Java heap space

2025-01-13 Thread Russell Jurney
:10 skrev Ángel : > >> Hi Russell, >> >> I've just got the OOM error during Test 13. I'm running it from IntelliJ >> on Windows with Java 11. >> >> [image: image.png] >> I'll look into it over the course of the next week. >> >>

GraphFrames' ConnectedComponentSuite test 'two components and two dangling vertices' fails with OutOfMemoryError: Java heap space

2025-01-11 Thread Russell Jurney
aphFrames. Hackathon announced next week :) - GraphFrames Mailing List <https://groups.google.com/g/graphframes/>: ask questions about GraphFrames on our Google Group - #graphframes Discord Channel on GraphGeeks <https://discord.com/channels/1162999022819225631/1326257052

Re: LLM based data pre-processing

2025-01-03 Thread Russell Jurney
even with a custom model, if your using 3rd party inference > or even just trying to keep your GPUs warm in general the co-location may > not be as important. > > On Fri, Jan 3, 2025 at 9:01 AM Russell Jurney > wrote: > >> Thanks! The first link is old, here is a more rec

Re: LLM based data pre-processing

2025-01-03 Thread Russell Jurney
Thanks! The first link is old, here is a more recent one: 1) https://python.langchain.com/docs/integrations/providers/spark/#spark-sql-individual-tools Russell On Fri, Jan 3, 2025 at 8:50 AM Gurunandan wrote: > HI Mayur, > Please evaluate Langchain's Spark Dataframe Agent for your use case. >

Re: LLM based data pre-processing

2025-01-03 Thread Russell Jurney
I don't have an answer, but I have the very same questions and am eagerly awaiting a solid response :) Russell On Fri, Jan 3, 2025 at 5:07 AM Mayur Dattatray Bhosale wrote: > Hi team, > > We are planning to use Spark for pre-processing the ML training data given > the data is 500+ TBs. > > One

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-18 Thread Russell Jurney
t; > I have made the change suggested in jira and was able to run the tests > after building. > Opened up a PR <https://github.com/apache/spark/pull/48871>. > Can you review it? > > Regards > Awadhesh > > On Mon, Nov 18, 2024 at 1:57 PM Russell Jurney > wrote: >

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-18 Thread Russell Jurney
; relevant to GraphX. While the other three JIRAs mention GraphX in their > descriptions, they appear to be more related to the build or the REPL > rather than GraphX itself. > > Thanks, > > Xiao > > > > > > > On Nov 16, 2024 at 5:39:27 PM, Russell Jurney >

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-12 Thread Russell Jurney
raphX and vice versa, which is a good place to learn. If you have any bug suggestions, please let me know. Russ On Tue, Nov 12, 2024 at 12:58 PM Russell Jurney wrote: > That is unfortunate. I saw someone volunteer to review my PRs. I thought > there was a holdout? > > On Tue, Nov 12,

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-12 Thread Russell Jurney
ere trying to get maintainers the > deprecation of GraphX passed suddenly in the middle of that discussion. > > El mar, 12 nov 2024, 21:47, Russell Jurney > escribió: > >> I guess you missed where Reynold Xin suggested we instead bring >> GraphFrames into Spark and others ag

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-12 Thread Russell Jurney
ng" against deprecating > GraphX because it seemed not have any maintainers in quite a few time. > Maybe I got it wrong. > > El mar, 12 nov 2024, 19:12, Russell Jurney > escribió: > >> Not sure what you mean? GraphX is the core Apache Spark technology >> underneath G

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-12 Thread Russell Jurney
st. Thanks, Russell Jurney On Tue, Nov 12, 2024 at 6:48 AM Ángel wrote: > But the goal wasn't to fix bugs in GraphX? What has that to do with > graphframes? > > El mar, 12 nov 2024, 12:58, Russell Jurney > escribió: > >> I started working on GraphFrames this

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-12 Thread Russell Jurney
ussell On Wed, Oct 16, 2024 at 6:53 PM Russell Jurney wrote: > For starters I created a ticket. I'm going to work on the project a bit > and then name a date and time. > > https://github.com/graphframes/graphframes/issues/460 > > On Tue, Oct 15, 2024 at 7:48 PM Ángel

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-16 Thread Russell Jurney
st > and distribute the tasks among us. We can also share the knowledge we gain > from resolving them. > btw, what happened to the (great) hackathon idea? any date/s in mind? > > El mié, 16 oct 2024 a las 3:53, Russell Jurney () > escribió: > >> I've never used Visual

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-07 Thread Russell Jurney
thealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >> >> On Mon

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-07 Thread Russell Jurney
n.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her > > > On Mon, Oct 7, 2024 at 5:02 PM Russell Jurney > wrote: > >> I’ll look for a bug to fix. If GraphX is outside of Spark, Spark would >

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-07 Thread Russell Jurney
t; > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her > > > On Mon, Oct 7, 2024 at 3:25 PM Russell Jurney > wrote: > >> I volunteer to maintain GraphX to keep GraphFrames a viable project. I >> don’t have a clear view on whether i

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-07 Thread Russell Jurney
; *Date: *Sunday, October 6, 2024 at 06:22 > *To: *Ángel > *Cc: *Russell Jurney , Mich Talebzadeh < > mich.talebza...@gmail.com>, Spark dev list , user > @spark > *Subject: *Re: [DISCUSS] Deprecate GraphX OR Find new maintainers > interested in GraphX OR leave it as is? >

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-05 Thread Russell Jurney
A lot of people like me use GraphFrames for its connected components implementation and its motif matching feature. I am willing to work on it to keep it alive. They did a 0.8.3 release not too long ago. Please keep GraphX alive. On Sat, Oct 5, 2024 at 3:44 PM Mich Talebzadeh wrote: > I added th

Re: Terabytes data processing via Glue

2024-06-03 Thread Russell Jurney
You could use either Glue or Spark for your job. Use what you’re more comfortable with. Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com On Sun, Jun

Re: OOM concern

2024-05-28 Thread Russell Jurney
<https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>> >>> >>> On Tue, 28 May 2024 at 09:04, Perez wrote: >>> >>>> Thank you everyone for your response. >>&

Re: OOM concern

2024-05-27 Thread Russell Jurney
docs using pyspark.sql.DataFrame.repartition(n) at the start of your job. Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com On Mon, May 27, 2024 at 9:

Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-15 Thread Russell Jurney
There is an implementation for Spark 3, but GraphFrames isn't released often enough to match every point version. It supports Spark 3.4. Try it - it will probably work. https://spark-packages.org/package/graphframes/graphframes Thanks, Russell Jurney @rjurney <http://twitter.com

Re: read a binary file and save in another location

2023-03-09 Thread Russell Jurney
Yeah, that's the right answer! Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly <https://calendly.com/rjurney_perso

Re: read a binary file and save in another location

2023-03-09 Thread Russell Jurney
aFrameWriter isn't documented <https://spark.apache.org/docs/3.1.3/api/java/org/apache/spark/sql/DataFrameWriter.html#format-java.lang.String-> . Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <h

Re: [New Project] sparksql-ml : Distributed Machine Learning using SparkSQL.

2023-02-27 Thread Russell Jurney
I think it is awesome. Brilliant interface that is missing from Spark. Would you integrate with something like MLFlow? Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney>

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Russell Jurney
Oliver, just curious: did you get a clean error message when you broke it out into separate statements? Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Russell Jurney
Usually, the solution to these problems is to do less per line, break it out and perform each minute operation as a field, then combine those into a final answer. Can you do that here? Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linked

Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-23 Thread Russell Jurney
n >> > 2. Using base dataframes itself (without explicit repartitioning) to >> perform join+aggregatio >> > >> > I have a StackOverflow post with more details regarding the same: >> > https://stackoverflow.com/q/74771971/14741697 >> > >> > T

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
jobs would benefit (a lot) from it? >> >> Thanks, >> >> --- Sungwoo >> >> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney >> wrote: >> >>> I don't think Spark can do this with its current architecture. It has to >>> wait for the step to be done, s

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
Oops, it has been long since Russell labored on Hadoop, speculative execution isn’t the right term - that is something else. Cascading has a declarative interface so you can plan more, whereas Spark is more imperative. Point remains :) On Wed, Sep 7, 2022 at 3:56 PM Russell Jurney wrote: >

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
of Spark jobs would benefit (a lot) from it? > > Thanks, > > --- Sungwoo > > On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney > wrote: > >> I don't think Spark can do this with its current architecture. It has to >> wait for the step to be done, speculative

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
I don't think Spark can do this with its current architecture. It has to wait for the step to be done, speculative execution isn't possible. Others probably know more about why that is. Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Russell Jurney
YOU know what you're talking about and aren't hacking a solution. You are my new friend :) Thank you, this is incredibly helpful! Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <h

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Russell Jurney
consistent, measurable and valid results! :) Russell Jurney On Thu, Aug 25, 2022 at 10:00 AM Sean Owen wrote: > It's important to realize that while pandas UDFs and pandas on Spark are > both related to pandas, they are not themselves directly related. The first > lets you use pandas wit

Re: Can't load a RandomForestClassificationModel in Spark job

2017-02-16 Thread Russell Jurney
/ch08/make_predictions_streaming.py I had to create a pyspark.sql.Row in a map operation in an RDD before I call spark.createDataFrame. Check out lines 92-138. Not sure if this helps, but I thought I'd give it a try ;) --- Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gm

Re: Spark / Elasticsearch Error: Maybe ES was overloaded? How to throttle down Spark as it writes to ES

2017-01-18 Thread Russell Jurney
o much use. > > check out these settings, maybe they are of some help: > es.batch.size.bytes > es.batch.size.entries > es.http.timeout > es.batch.write.retry.count > es.batch.write.retry.wait > > > On Tue, Jan 17, 2017 at 10:13 PM, Russell Jurney > wrote: > > How can I thrott

Spark / Elasticsearch Error: Maybe ES was overloaded? How to throttle down Spark as it writes to ES

2017-01-17 Thread Russell Jurney
https://discuss.elastic.co/t/spark-elasticsearch-exception-maybe-es-was-overloaded/71932 Thanks! -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

In PySpark ML, how can I interpret the SparseVector returned by a pyspark.ml.classification.RandomForestClassificationModel.featureImportances ?

2016-12-21 Thread Russell Jurney
Jupyter Notebook on Github here <https://github.com/rjurney/Agile_Data_Code_2/blob/master/ch09/Debugging%20Prediction%20Problems.ipynb>, skip to the end. Stack Overflow post: http://stackoverflow.com/questions/41273893/in-pyspark-ml-how-can-i-interpret-the-sparsevector-returned-by-a-pyspark-m

Re: What is the deployment model for Spark Streaming? A specific example.

2016-12-17 Thread Russell Jurney
Anyone? This is for a book, so I need to figure this out. On Fri, Dec 16, 2016 at 12:53 AM Russell Jurney wrote: > I have created a PySpark Streaming application that uses Spark ML to > classify flight delays into three categories: on-time, slightly late, very > late. After an h

What is the deployment model for Spark Streaming? A specific example.

2016-12-16 Thread Russell Jurney
to run the app, maybe that is the problem? ssc.start() ssc.awaitTermination() What is the actual deployment model for Spark Streaming? All I know to do right now is to restart the PID. I'm new to Spark, and the docs don't really explain this (that I can see). Thanks! -- Russell J

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-16 Thread Russell Jurney
t convert the dataframe to > an rdd using something like this: > > df > .toJavaRDD() > .map(row -> (SparseVector)row.getAs(row.fieldIndex("columnName"))); > > On Tue, Nov 15, 2016 at 1:06 PM, Russell Jurney > wrote: > >> I have two dataframes

Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-15 Thread Russell Jurney
, but haven't found anything. Thanks! -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Parquet compression jars not found - both snappy and lzo - PySpark 2.0.0

2016-09-27 Thread Russell Jurney
is here: https://gist.github.com/rjurney/6783d19397cf3b4b88af3603d6e256bd -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: Automating lengthy command to pyspark with configuration?

2016-08-29 Thread Russell Jurney
I've got most of it working through spark.jars On Sunday, August 28, 2016, ayan guha wrote: > Best to create alias and place in your bashrc > On 29 Aug 2016 08:30, "Russell Jurney" > wrote: > >> In order to use PySpark with MongoDB and ElasticSearch, I currentl

Automating lengthy command to pyspark with configuration?

2016-08-28 Thread Russell Jurney
ngthy additions to pyspark? Thanks! -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: --packages configuration equivalent item name?

2016-04-05 Thread Russell Jurney
Thanks! These aren't in the docs, I will make a JIRA to add them. On Monday, April 4, 2016, Saisai Shao wrote: > spark.jars.ivy, spark.jars.packages, spark.jars.excludes is the > configurations you can use. > > Thanks > Saisai > > On Sun, Apr 3, 2016 at 1:59 A

Re: --packages configuration equivalent item name?

2016-04-02 Thread Russell Jurney
gt; > export PYSPARK_DRIVER_PYTHON=python3 > > IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs --conf > spark.cassandra.connection.host= > ec2-54-153-102-232.us-west-1.compute.amazonaws.com $* > > > > From: Russell Jurney > Date: Sunday, March 27, 2016 at

What is the most efficient way to do a sorted reduce in PySpark?

2016-04-02 Thread Russell Jurney
ter? Thanks! Stack Overflow: http://stackoverflow.com/questions/36376369/what-is-the-most-efficient-way-to-do-a-sorted-reduce-in-pyspark Gist: https://gist.github.com/rjurney/af27f70c76dc6c6ae05c465271331ade -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-30 Thread Russell Jurney
Actually, I can imagine a one or two line fix for this bug: call row.asDict() inside a wrapper for DataFrame.rdd. Probably deluding myself this could be so easily resolved? :) On Wed, Mar 30, 2016 at 6:10 PM, Russell Jurney wrote: > Thanks to some excellent work by Luke Lovett, we h

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-30 Thread Russell Jurney
as saving to a database is a pretty common thing to do from PySpark, and lots of analysis must be happening in DataFrames in PySpark? Anyway, the workaround for this bug is easy, cast the rows as dicts: my_dataframe = my_dataframe.map(lambda row: row.asDict()) On Mon, Mar 28, 2016 at 8:08 PM,

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
btw, they can't be saved to BSON either. This seems a generic issue, can anyone else reproduce this? On Mon, Mar 28, 2016 at 8:02 PM, Russell Jurney wrote: > I created a JIRA: https://issues.apache.org/jira/browse/SPARK-14229 > > On Mon, Mar 28, 2016 at 7:43 PM, Russell J

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-14229 On Mon, Mar 28, 2016 at 7:43 PM, Russell Jurney wrote: > Ted, I am using the .rdd method, see above, but for some reason these RDDs > can't be saved to MongoDB or ElasticSearch. > > I think this is a bug in PyS

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
= { > > On Mon, Mar 28, 2016 at 6:30 PM, Russell Jurney > wrote: > >> Ok, I'm also unable to save to Elasticsearch using a dataframe's RDD. >> This seems related to DataFrames. Is there a way to convert a DataFrame's >> RDD to a 'normal' RDD? &g

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
Ok, I'm also unable to save to Elasticsearch using a dataframe's RDD. This seems related to DataFrames. Is there a way to convert a DataFrame's RDD to a 'normal' RDD? On Mon, Mar 28, 2016 at 6:20 PM, Russell Jurney wrote: > I filed a JIRA <https://jira.mongod

PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
sk.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: DataFrame --> JSON objects, instead of un-named array of fields

2016-03-28 Thread Russell Jurney
To answer my own question, DataFrame.toJSON() does this, so there is no need to map and json.dump(): on_time_dataframe.toJSON().saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl') Thanks! On Mon, Mar 28, 2016 at 12:54 PM, Russell Jurney wrote: > In PySpark, given a

DataFrame --> JSON objects, instead of un-named array of fields

2016-03-28 Thread Russell Jurney
null, null, 0, null, null, null, null, "", null, null, null, null, null, null, "", "", null, null, null, null, null, null, "", "", null, null, null, null, null, "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""] What I actually want is JSON objects, with a field name for each field: {"year": "2015", "month": 1, ...} How can I achieve this in PySpark? Thanks! -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

--packages configuration equivalent item name?

2016-03-27 Thread Russell Jurney
n.html If there is no way to do this, please let me know so I can make a JIRA for this feature. Thanks! -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: Spark JDBC connection - data writing success or failure cases

2016-02-19 Thread Russell Jurney
il: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: GraphX can show graph?

2016-01-29 Thread Russell Jurney
I could load and >>> compute few graph statistics. However, I am not sure whether it is possible >>> to create ad show graph (for visualization purpose) using GraphX. Any >>> pointer to tutorial or information connected to this will be really helpful >>> >>> Thanks and regards >>> Bala >>> >> >> > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Re: processing large dataset

2015-01-22 Thread Russell Jurney
for >> processing terabytes of data and is there a way to make this >> configuration easier and more transparent? >> >> Thanks. >> >> --------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
There was a bug in the devices line: dh.index('id') should have been x[dh.index('id')]. ᐧ On Fri, Oct 17, 2014 at 5:52 PM, Russell Jurney wrote: > Is that not exactly what I've done in j3/j4? The keys are identical > strings.The k is the same, the value in b

Re: PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
t; join() can only work with RDD of pairs (key, value), such as > > rdd1: (k, v1) > rdd2: (k, v2) > > rdd1.join(rdd2) will be (k1, v1, v2) > > Spark SQL will be more useful for you, see > http://spark.apache.org/docs/1.1.0/sql-programming-guide.html > > Davies > > > On Fri,

PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
https://gist.github.com/rjurney/fd5c0110fe7eb686afc9 Any way I try to join my data fails. I can't figure out what I'm doing wrong. -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com ᐧ

Re: Update on Pig on Spark initiative

2014-08-27 Thread Russell Jurney
alytics), Anish Haldiya (Sigmoid Analytics), Aniket > Mokashi (Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid > Analytics), Mahesh Kalakoti (Sigmoid Analytics) > > Not to mention Spark & Pig communities. > > Regards > Mayur Rustagi > Ph: +1 (760) 203 3257

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
should definitely list your > processes. > > What does hivecluster2:8080 look like? My guess is it says there are 2 > applications registered, and one has taken all the executors. There must be > two applications running, as those are the only things that keep open those > 4040/40

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
, Russell Jurney wrote: > Looks like just worker and master processes are running: > > [hivedata@hivecluster2 ~]$ jps > > 10425 Jps > > [hivedata@hivecluster2 ~]$ ps aux|grep spark > > hivedata 10424 0.0 0.0 103248 820 pts/3S+ 10:05 0:00 grep spark > >

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
y started, and what resource allocations they have. > > > On Sun, Jun 1, 2014 at 6:47 PM, Russell Jurney > wrote: > >> Thanks again. Run results here: >> https://gist.github.com/rjurney/dc0efae486ba7d55b7d5 >> >> This time I get a port already in use exception on

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Russell Jurney
setJars("avro.jar", ...) > val sc = new SparkContext(conf) > > > On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney > wrote: > >> Followup question: the docs to make a new SparkContext require that I >> know where $SPARK_HOME is. However, I have no idea. Any idea where t

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Russell Jurney
aightforward ways of > producing assembly jars. > > > On Sat, May 31, 2014 at 11:23 PM, Russell Jurney > wrote: > >> Thanks for the fast reply. >> >> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in >> standalone mode. >> >&g

Re: hadoopRDD stalls reading entire directory

2014-05-31 Thread Russell Jurney
n Sat, May 31, 2014 at 9:37 PM, Russell Jurney > wrote: > > Now I get this: > > scala> rdd.first > > 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at > :41 > > 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at > :41) wit

hadoopRDD stalls reading entire directory

2014-05-31 Thread Russell Jurney
I to ensure that workers are registered and have sufficient memory 14/05/31 17:03:32 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory .. And never finishes. What should I do? --