Parallelism for glue pyspark jobs

2025-04-15 Thread Perez
Hi Everyone, I am facing one issue. The problem is explained in detail in the below SO post. Any suggestions would be appreciated. https://stackoverflow.com/questions/79574599/unable-to-configure-the-exact-number-of-dpus-for-the-glue-pyspark-job Thanks

Re: AWS Glue PySpark Job

2025-01-04 Thread Perez
Hi Team, I would appreciate any help with this. https://stackoverflow.com/questions/79324390/aws-glue-pyspark-job-is-not-ending/79324917#79324917 On Fri, Jan 3, 2025 at 3:53 PM Perez wrote: > Hi Team, > > I would need your help in understanding the below problem. &g

AWS Glue PySpark Job

2025-01-03 Thread Perez
Hi Team, I would need your help in understanding the below problem. https://stackoverflow.com/questions/79324390/aws-glue-pyspark-job-is-not-ending/79324917#79324917

Data is not displayed in the readable format

2024-10-24 Thread Perez
Hi Team, It would be really helpful if I can get any help on this https://stackoverflow.com/questions/79121611/unable-to-get-the-postgres-data-in-the-right-format-via-kafka-jdbc-source-conne Any hints would be appreciated. Thanks,

Unable to format data

2024-10-23 Thread Perez
Hi Team Need your help here. https://stackoverflow.com/questions/79118843/unable-to-format-the-kafka-topic-data-via-pyspark Thanks

Re: dynamically infer json data not working as expected

2024-08-08 Thread Perez
Also, I checked your code but it will again give the same result even if I do sampling because the schema of the "data" attribute is not fixed. Any suggestions? On Thu, Aug 8, 2024 at 12:34 PM Perez wrote: > Hi Mich, > > Thanks a lot for your answer but there is one

Re: dynamically infer json data not working as expected

2024-08-08 Thread Perez
Hi Mich, Thanks a lot for your answer but there is one more scenario to it. The schema of the data attribute inside the steps column is not fixed. For some records, I see it as a struct and for others, I see it as an Array of objects. So at last it treats it as string only since it gets confused

Re: dynamically infer json data not working as expected

2024-08-05 Thread Perez
https://stackoverflow.com/questions/78835509/dynamically-infer-schema-of-json-data-using-pyspark Any help would be appreciated. Thanks, On Mon, Aug 5, 2024 at 10:35 PM Perez wrote: > Hello everyone, > > I have described my problem on the SO blog : > >

dynamically infer json data not working as expected

2024-08-05 Thread Perez
Hello everyone, I have described my problem on the SO blog :

AWS Glue and Python

2024-06-26 Thread Perez
Hi Team I am facing one issue here https://stackoverflow.com/questions/78673228/unable-to-read-text-file-in-glue-job TIA

Re: Unable to load MongoDB atlas data via PySpark because of BsonString error

2024-06-09 Thread Perez
Hi Team, Any help in this matter would be greatly appreciated. TIA On Sun, Jun 9, 2024 at 11:26 AM Perez wrote: > Hi Team, > > this is the problem > https://stackoverflow.com/questions/78593858/unable-to-load-mongodb-atlas-data-via-pyspark-jdbc-in-glue > > I can't go

Unable to load MongoDB atlas data via PySpark because of BsonString error

2024-06-08 Thread Perez
Hi Team, this is the problem https://stackoverflow.com/questions/78593858/unable-to-load-mongodb-atlas-data-via-pyspark-jdbc-in-glue I can't go ahead with *StructType* approach since my input record is huge and if the underlying attributes are added or removed my code might fail. I can't change

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Also can I take my lower bound starting from 1 or is it index? On Thu, Jun 6, 2024 at 8:42 PM Perez wrote: > Thanks again Mich. It gives the clear picture but I have again couple of > doubts: > > 1) I know that there will be multiple threads that will be executed with > 10 seg

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
ondon, United Kingdom > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowle

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote &

Re: Terabytes data processing via Glue

2024-06-05 Thread Perez
ney> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com > > > On Sun, Jun 2, 2024 at 9:59 PM Perez wrote: > >> Hello, >> >> Can I get some suggestions? >> >> On Sat, Jun 1

Do we need partitioning while loading data from JDBC sources?

2024-06-05 Thread Perez
Hello experts, I was just wondering if I could leverage the below thing to expedite the loading of the data process in Spark. def extract_data_from_mongodb(mongo_config): df = glueContext.create_dynamic_frame.from_options( connection_type="mongodb", connection_options=mongo_config ) return df m

Re: Terabytes data processing via Glue

2024-06-02 Thread Perez
Hello, Can I get some suggestions? On Sat, Jun 1, 2024 at 1:18 PM Perez wrote: > Hi Team, > > I am planning to load and process around 2 TB historical data. For that > purpose I was planning to go ahead with Glue. > > So is it ok if I use glue if I calculate my DPUs ne

Terabytes data processing via Glue

2024-06-01 Thread Perez
Hi Team, I am planning to load and process around 2 TB historical data. For that purpose I was planning to go ahead with Glue. So is it ok if I use glue if I calculate my DPUs needed correctly? or should I go with EMR. This will be a one time activity. TIA

Tox and Pyspark

2024-05-28 Thread Perez
Hi Team, I need help with this https://stackoverflow.com/questions/78547676/tox-with-pyspark

Re: OOM concern

2024-05-28 Thread Perez
note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Tue, 28 May 2024 at 16:40, Perez wrote: > &

Re: OOM concern

2024-05-28 Thread Perez
ential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Tue, 28 May 2024 at 09:

Re: OOM concern

2024-05-27 Thread Perez
;http://twitter.com/rjurney> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com > > > On Mon, May 27, 2024 at 9:15 AM Perez wrote: > >> Hi Team, >> >> I want to extract the data from

OOM concern

2024-05-27 Thread Perez
Hi Team, I want to extract the data from DB and just dump it into S3. I don't have to perform any transformations on the data yet. My data size would be ~100 GB (historical load). Choosing the right DPUs(Glue jobs) should solve this problem right? Or should I move to EMR. I don't feel the need t

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Perez
You can try the 'optimize' command of delta lake. That will help you for sure. It merges small files. Also, it depends on the file format. If you are working with Parquet then still small files should not cause any issues. P. On Thu, Oct 5, 2023 at 10:55 AM Shao Yang Hong wrote: > Hi Raghavendr

Re: cache OS memory and spark usage of it

2018-04-10 Thread Jose Raul Perez Rodriguez
it was helpful, Then, the OS needs to fill some pressure from the applications requesting memory to free some memory cache? Exactly under which circumstances the OS free that memory to give it to applications requesting it? I mean if the total memory is 16GB and 10GB are used for OS cache,

Re: Kafka 0.10 & Spark Streaming 2.0.2

2016-12-02 Thread Gabriel Perez
I had it setup with three nodes, a master and 2 slaves. Is there anything that would tell me it was in local mode. I am also added the –deploy-mode cluster flag and saw the same results. Thanks, Gabe From: Mich Talebzadeh Date: Friday, December 2, 2016 at 12:26 PM To: Gabriel Perez Cc: Jacek

Re: Kafka 0.10 & Spark Streaming 2.0.2

2016-12-02 Thread Gabriel Perez
Laskowski Date: Friday, December 2, 2016 at 12:21 PM To: Gabriel Perez Cc: user Subject: Re: Kafka 0.10 & Spark Streaming 2.0.2 Hi, Can you post the screenshot of the Executors and Streaming tabs? Jacek On 2 Dec 2016 5:54 p.m., "Gabriel Perez" mailto:gabr...@adtheorent.com>

Re: Kafka 0.10 & Spark Streaming 2.0.2

2016-12-02 Thread Gabriel Perez
11:47 AM To: Gabriel Perez Cc: user Subject: Re: Kafka 0.10 & Spark Streaming 2.0.2 Hi, How many partitions does the topic have? How do you check how many executors read from the topic? Jacek On 2 Dec 2016 2:44 p.m., "gabrielperez2484" mailto:gabr...@adtheorent.com>>

Re: VectorAssembler handling null values

2016-04-20 Thread Andres Perez
at 1:38 AM, Nick Pentreath wrote: > Could you provide an example of what your input data looks like? > Supporting missing values in a sparse result vector makes sense. > > On Tue, 19 Apr 2016 at 23:55, Andres Perez wrote: > >> Hi everyone. org.apache.spark.ml.feature.Vect

VectorAssembler handling null values

2016-04-19 Thread Andres Perez
Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently cannot handle null values. This presents a problem for us as we wish to run a decision tree classifier on sometimes sparse data. Is there a particular reason VectorAssembler is implemented in this way, and can anyone recommend the b

Re: Serializers problems maping RDDs to objects again

2015-11-06 Thread Iker Perez de Albeniz
I have seen that the problem is on the Geohash class that can not be picked.. but in groupByKey i use an other custom class an there is no problem... 2015-11-06 13:44 GMT+01:00 Iker Perez de Albeniz : > Hi All, > > I am new at this list. Before sending this mail i have searched on arch

Serializers problems maping RDDs to objects again

2015-11-06 Thread Iker Perez de Albeniz
dpickle.py", line 107, in dump File "/usr/lib/python2.7/pickle.py", line 224, in dump self.save(obj) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 562, in save_tuple save(element) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/iker/Workspace//spark-1.4.1-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 199, in save_function File "/home/iker/Workspace//spark-1.4.1-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 236, in save_function_tuple i do not know if the problems is that i am not understanding how spark works or what else but i do not see how to make it work and continue map/filter/reducing the data i several concatenated steps. Regards, -- [image: Fon] <http://www.fon.com/>Iker Perez de AlbenizSenior R&D Engineer / Technical Lead+34 946545843Skype: iker.perez.fonAll information in this email is confidential <http://corp.fon.com/legal/email-disclaimer>

Re: stopped SparkContext remaining active

2015-07-29 Thread Andres Perez
5 at 1:10 PM, Ted Yu wrote: > bq. it seems like we never get to the clearActiveContext() call by the end > > Looking at stop() method, there is only one early return > after stopped.compareAndSet() call. > Is there any clue from driver log ? > > Cheers > > On Wed, Jul 29

stopped SparkContext remaining active

2015-07-29 Thread Andres Perez
Hi everyone. I'm running into an issue with SparkContexts when running on Yarn. The issue is observable when I reproduce these steps in the spark-shell (version 1.4.1): scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@7b965dee *Note the pointer address of sc. (Then y

Re: Super slow caching in 1.3?

2015-04-27 Thread Christian Perez
atabricks.com] > Sent: Thursday, April 16, 2015 7:23 PM > To: Evo Eftimov > Cc: Christian Perez; user > > > Subject: Re: Super slow caching in 1.3? > > > > Here are the types that we specialize, other types will be much slower. > This is only for Spark SQL, normal RDD

Re: Pyspark where do third parties libraries need to be installed under Yarn-client mode

2015-04-24 Thread Christian Perez
le.com/Pyspark-where-do-third-parties-libraries-need-to-be-installed-under-Yarn-client-mode-tp22639.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> ----- >> To unsubscrib

Re: Super slow caching in 1.3?

2015-04-16 Thread Christian Perez
g back to > kryo and even then there are some locking issues). > > If so, would it be possible to try caching a flattened version? > > CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable > > On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez wrote: >> >> Hi al

Super slow caching in 1.3?

2015-04-06 Thread Christian Perez
Hi all, Has anyone else noticed very slow time to cache a Parquet file? It takes 14 s per 235 MB (1 block) uncompressed node local Parquet file on M2 EC2 instances. Or are my expectations way off... Cheers, Christian -- Christian Perez Silicon Valley Data Science Data Analyst christ

Re: input size too large | Performance issues with Spark

2015-04-02 Thread Christian Perez
we wan to use Spark to provide us the capability to process our >> in-memory data structure very fast as well as scale to a larger volume >> when >> required in the future. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-l

Re: persist(MEMORY_ONLY) takes lot of time

2015-04-02 Thread Christian Perez
gt; http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: use

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-20 Thread Christian Perez
Any other users interested in a feature DataFrame.saveAsExternalTable() for making _useful_ external tables in Hive, or am I the only one? Bueller? If I start a PR for this, will it be taken seriously? On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez wrote: > Hi Yin, > > Thank

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Christian Perez
gt;> property, there will be a field called "spark.sql.sources.provider" and the >> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also >> look at your files in the file system. They are stored by Parquet. >> >> Thanks, >> >> Yi

saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Christian Perez
alized properly on receive. I'm tracing execution through source code... but before I get any deeper, can anyone reproduce this behavior? Cheers, Christian -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.com @cp_phd -

Re: Problem starting worker processes in standalone mode

2014-03-24 Thread Yonathan Perez
Oh, I also forgot to mention: I start the master and workers (call ./sbin/start-all.sh), and then start the shell: MASTER=spark://localhost:7077 ./bin/spark-shell Then I get the exceptions... Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-

Problem starting worker processes in standalone mode

2014-03-24 Thread Yonathan Perez
Hi, I'm running my program on a single large memory many core machine (64 cores, 1TB RAM). But to avoid having huge JVMs, I want to use several processes / worker instances - each using 8 cores (i.e. use SPARK_WORKER_INSTANCES). When I use 2 worker instances, everything works fine, but when I try

Re: OutOfMemoryError when loading input file

2014-03-03 Thread Yonathan Perez
Thanks for your answer yxzhao, but setting SPARK_MEM doesn't solve the problem. I also understand that setting SPARK_MEM is the same as calling SparkConf.set("spark.executor.memory",..) which I do. Any additional advice would be highly appreciated. -- View this message in context: http://apac

OutOfMemoryError when loading input file

2014-03-01 Thread Yonathan Perez
Hello, I'm trying to run a simple test program that loads a large file (~12.4GB) into memory of a single many-core machine. The machine I'm using has more than enough memory (1TB RAM) and 64 cores (of which I want to use 16 for worker threads). Even though I set both the executor memory (spark.exe