Hi Everyone,
I am facing one issue. The problem is explained in detail in the below SO
post.
Any suggestions would be appreciated.
https://stackoverflow.com/questions/79574599/unable-to-configure-the-exact-number-of-dpus-for-the-glue-pyspark-job
Thanks
Hi Team,
I would appreciate any help with this.
https://stackoverflow.com/questions/79324390/aws-glue-pyspark-job-is-not-ending/79324917#79324917
On Fri, Jan 3, 2025 at 3:53 PM Perez wrote:
> Hi Team,
>
> I would need your help in understanding the below problem.
&g
Hi Team,
I would need your help in understanding the below problem.
https://stackoverflow.com/questions/79324390/aws-glue-pyspark-job-is-not-ending/79324917#79324917
Hi Team,
It would be really helpful if I can get any help on this
https://stackoverflow.com/questions/79121611/unable-to-get-the-postgres-data-in-the-right-format-via-kafka-jdbc-source-conne
Any hints would be appreciated.
Thanks,
Hi Team
Need your help here.
https://stackoverflow.com/questions/79118843/unable-to-format-the-kafka-topic-data-via-pyspark
Thanks
Also, I checked your code but it will again give the same result even if I
do sampling because the schema of the "data" attribute is not fixed.
Any suggestions?
On Thu, Aug 8, 2024 at 12:34 PM Perez wrote:
> Hi Mich,
>
> Thanks a lot for your answer but there is one
Hi Mich,
Thanks a lot for your answer but there is one more scenario to it.
The schema of the data attribute inside the steps column is not fixed. For
some records, I see it as a struct and for others, I see it as an Array of
objects.
So at last it treats it as string only since it gets confused
https://stackoverflow.com/questions/78835509/dynamically-infer-schema-of-json-data-using-pyspark
Any help would be appreciated.
Thanks,
On Mon, Aug 5, 2024 at 10:35 PM Perez wrote:
> Hello everyone,
>
> I have described my problem on the SO blog :
>
>
Hello everyone,
I have described my problem on the SO blog :
Hi Team I am facing one issue here
https://stackoverflow.com/questions/78673228/unable-to-read-text-file-in-glue-job
TIA
Hi Team,
Any help in this matter would be greatly appreciated.
TIA
On Sun, Jun 9, 2024 at 11:26 AM Perez wrote:
> Hi Team,
>
> this is the problem
> https://stackoverflow.com/questions/78593858/unable-to-load-mongodb-atlas-data-via-pyspark-jdbc-in-glue
>
> I can't go
Hi Team,
this is the problem
https://stackoverflow.com/questions/78593858/unable-to-load-mongodb-atlas-data-via-pyspark-jdbc-in-glue
I can't go ahead with *StructType* approach since my input record is huge
and if the underlying attributes are added or removed my code might fail.
I can't change
Also can I take my lower bound starting from 1 or is it index?
On Thu, Jun 6, 2024 at 8:42 PM Perez wrote:
> Thanks again Mich. It gives the clear picture but I have again couple of
> doubts:
>
> 1) I know that there will be multiple threads that will be executed with
> 10 seg
ondon, United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowle
in/mich-talebzadeh-ph-d-5205b2/>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote &
ney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
> On Sun, Jun 2, 2024 at 9:59 PM Perez wrote:
>
>> Hello,
>>
>> Can I get some suggestions?
>>
>> On Sat, Jun 1
Hello experts,
I was just wondering if I could leverage the below thing to expedite the
loading of the data process in Spark.
def extract_data_from_mongodb(mongo_config): df =
glueContext.create_dynamic_frame.from_options( connection_type="mongodb",
connection_options=mongo_config ) return df
m
Hello,
Can I get some suggestions?
On Sat, Jun 1, 2024 at 1:18 PM Perez wrote:
> Hi Team,
>
> I am planning to load and process around 2 TB historical data. For that
> purpose I was planning to go ahead with Glue.
>
> So is it ok if I use glue if I calculate my DPUs ne
Hi Team,
I am planning to load and process around 2 TB historical data. For that
purpose I was planning to go ahead with Glue.
So is it ok if I use glue if I calculate my DPUs needed correctly? or
should I go with EMR.
This will be a one time activity.
TIA
Hi Team,
I need help with this
https://stackoverflow.com/questions/78547676/tox-with-pyspark
note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Tue, 28 May 2024 at 16:40, Perez wrote:
>
&
ential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Tue, 28 May 2024 at 09:
;http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
> On Mon, May 27, 2024 at 9:15 AM Perez wrote:
>
>> Hi Team,
>>
>> I want to extract the data from
Hi Team,
I want to extract the data from DB and just dump it into S3. I
don't have to perform any transformations on the data yet. My data size
would be ~100 GB (historical load).
Choosing the right DPUs(Glue jobs) should solve this problem right? Or
should I move to EMR.
I don't feel the need t
You can try the 'optimize' command of delta lake. That will help you for
sure. It merges small files. Also, it depends on the file format. If you
are working with Parquet then still small files should not cause any issues.
P.
On Thu, Oct 5, 2023 at 10:55 AM Shao Yang Hong
wrote:
> Hi Raghavendr
it was helpful,
Then, the OS needs to fill some pressure from the applications
requesting memory to free some memory cache?
Exactly under which circumstances the OS free that memory to give it to
applications requesting it?
I mean if the total memory is 16GB and 10GB are used for OS cache,
I had it setup with three nodes, a master and 2 slaves. Is there anything that
would tell me it was in local mode. I am also added the –deploy-mode cluster
flag and saw the same results.
Thanks,
Gabe
From: Mich Talebzadeh
Date: Friday, December 2, 2016 at 12:26 PM
To: Gabriel Perez
Cc: Jacek
Laskowski
Date: Friday, December 2, 2016 at 12:21 PM
To: Gabriel Perez
Cc: user
Subject: Re: Kafka 0.10 & Spark Streaming 2.0.2
Hi,
Can you post the screenshot of the Executors and Streaming tabs?
Jacek
On 2 Dec 2016 5:54 p.m., "Gabriel Perez"
mailto:gabr...@adtheorent.com>
11:47 AM
To: Gabriel Perez
Cc: user
Subject: Re: Kafka 0.10 & Spark Streaming 2.0.2
Hi,
How many partitions does the topic have? How do you check how many executors
read from the topic?
Jacek
On 2 Dec 2016 2:44 p.m., "gabrielperez2484"
mailto:gabr...@adtheorent.com>>
at 1:38 AM, Nick Pentreath
wrote:
> Could you provide an example of what your input data looks like?
> Supporting missing values in a sparse result vector makes sense.
>
> On Tue, 19 Apr 2016 at 23:55, Andres Perez wrote:
>
>> Hi everyone. org.apache.spark.ml.feature.Vect
Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently cannot
handle null values. This presents a problem for us as we wish to run a
decision tree classifier on sometimes sparse data. Is there a particular
reason VectorAssembler is implemented in this way, and can anyone recommend
the b
I have seen that the problem is on the Geohash class that can not be
picked.. but in groupByKey i use an other custom class an there is no
problem...
2015-11-06 13:44 GMT+01:00 Iker Perez de Albeniz :
> Hi All,
>
> I am new at this list. Before sending this mail i have searched on arch
dpickle.py",
line 107, in dump
File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 562, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File
"/home/iker/Workspace//spark-1.4.1-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 199, in save_function
File
"/home/iker/Workspace//spark-1.4.1-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 236, in save_function_tuple
i do not know if the problems is that i am not understanding how spark
works or what else but i do not see how to make it work and continue
map/filter/reducing the data i several concatenated steps.
Regards,
--
[image: Fon] <http://www.fon.com/>Iker Perez de AlbenizSenior R&D Engineer
/ Technical Lead+34 946545843Skype: iker.perez.fonAll information in this
email is confidential <http://corp.fon.com/legal/email-disclaimer>
5 at 1:10 PM, Ted Yu wrote:
> bq. it seems like we never get to the clearActiveContext() call by the end
>
> Looking at stop() method, there is only one early return
> after stopped.compareAndSet() call.
> Is there any clue from driver log ?
>
> Cheers
>
> On Wed, Jul 29
Hi everyone. I'm running into an issue with SparkContexts when running on
Yarn. The issue is observable when I reproduce these steps in the
spark-shell (version 1.4.1):
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@7b965dee
*Note the pointer address of sc.
(Then y
atabricks.com]
> Sent: Thursday, April 16, 2015 7:23 PM
> To: Evo Eftimov
> Cc: Christian Perez; user
>
>
> Subject: Re: Super slow caching in 1.3?
>
>
>
> Here are the types that we specialize, other types will be much slower.
> This is only for Spark SQL, normal RDD
le.com/Pyspark-where-do-third-parties-libraries-need-to-be-installed-under-Yarn-client-mode-tp22639.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -----
>> To unsubscrib
g back to
> kryo and even then there are some locking issues).
>
> If so, would it be possible to try caching a flattened version?
>
> CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable
>
> On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez wrote:
>>
>> Hi al
Hi all,
Has anyone else noticed very slow time to cache a Parquet file? It
takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
on M2 EC2 instances. Or are my expectations way off...
Cheers,
Christian
--
Christian Perez
Silicon Valley Data Science
Data Analyst
christ
we wan to use Spark to provide us the capability to process our
>> in-memory data structure very fast as well as scale to a larger volume
>> when
>> required in the future.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-l
gt; http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe, e-mail: use
Any other users interested in a feature
DataFrame.saveAsExternalTable() for making _useful_ external tables in
Hive, or am I the only one? Bueller? If I start a PR for this, will it
be taken seriously?
On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez wrote:
> Hi Yin,
>
> Thank
gt;> property, there will be a field called "spark.sql.sources.provider" and the
>> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
>> look at your files in the file system. They are stored by Parquet.
>>
>> Thanks,
>>
>> Yi
alized
properly on receive.
I'm tracing execution through source code... but before I get any
deeper, can anyone reproduce this behavior?
Cheers,
Christian
--
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd
-
Oh, I also forgot to mention:
I start the master and workers (call ./sbin/start-all.sh), and then start
the shell:
MASTER=spark://localhost:7077 ./bin/spark-shell
Then I get the exceptions...
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-
Hi,
I'm running my program on a single large memory many core machine (64 cores,
1TB RAM). But to avoid having huge JVMs, I want to use several processes /
worker instances - each using 8 cores (i.e. use SPARK_WORKER_INSTANCES).
When I use 2 worker instances, everything works fine, but when I try
Thanks for your answer yxzhao, but setting SPARK_MEM doesn't solve the
problem.
I also understand that setting SPARK_MEM is the same as calling
SparkConf.set("spark.executor.memory",..) which I do.
Any additional advice would be highly appreciated.
--
View this message in context:
http://apac
Hello,
I'm trying to run a simple test program that loads a large file (~12.4GB)
into memory of a single many-core machine.
The machine I'm using has more than enough memory (1TB RAM) and 64 cores
(of which I want to use 16 for worker threads).
Even though I set both the executor memory (spark.exe
48 matches
Mail list logo