Hi Marco,
I am not sure if you will get access to data frame inside the for each, as
spark context used to be non serialized, if I remember correctly.
One thing you can do.
Use cogroup operation on both the dataset.
This will help you have (Key- iter(v1),itr(V2).
And then use for each partition f
Hi Sid,
Snappy itself is not splittable. But the format that contains the actual
data like parquet (which are basically divided into row groups) can be
compressed using snappy.
This works because blocks(pages of parquet format) inside the parquet can
be independently compressed using snappy.
Than
How do you ensure that the results are matched?
>
> Best,
> Sid
>
> On Sun, Jul 31, 2022 at 1:34 AM Amit Joshi
> wrote:
>
>> Hi Sid,
>>
>> Salting is normally a technique to add random characters to existing
>> values.
>> In big data we can use salt
values of the sale.
join_col
x1_1
x1_2
x2_1
x2_2
And then join it like
table1.join(table2, where tableA.new_join_col == tableB. join_col)
Let me know if you have any questions.
Regards
Amit Joshi
On Sat, Jul 30, 2022 at 7:16 PM Sid wrote:
> Hi Team,
>
> I was trying to understand th
Hi spark users,
Can anyone please provide any views on the topic.
Regards
Amit Joshi
On Sunday, October 3, 2021, Amit Joshi wrote:
> Hi Spark-Users,
>
> Hope you are doing good.
>
> I have been working on cases where a dataframe is joined with more than
> one data fr
uot;key2")
I was thinking of bucketing as a solution to speed up the joins. But if I
bucket df1 on the key1,then join2 may not benefit, and vice versa (if
bucket on key2 for df1).
or Should we bucket df1 twice, one with key1 and another with key2?
Is there a strategy to make both the joins faster for both the joins?
Regards
Amit Joshi
HI Mich,
Thanks for your email.
I have tried for the batch mode,
Still looking to try in streaming mode.
Will update you as per.
Regards
Amit Joshi
On Thu, Jun 17, 2021 at 1:07 PM Mich Talebzadeh
wrote:
> OK let us start with the basic cube
>
> create a DF first
>
> scal
s as well?
I hope I was able to make my point clear.
Regards
Amit Joshi
On Wed, Jun 16, 2021 at 11:36 PM Mich Talebzadeh
wrote:
>
>
> Hi,
>
> Just to clarify
>
> Are we talking about* rollup* as a subset of a cube that computes
> hierarchical subtotals from left to ri
Appreciate if someone could give some pointers in the question below.
-- Forwarded message -
From: Amit Joshi
Date: Tue, Jun 15, 2021 at 12:19 PM
Subject: [Spark]Does Rollups work with spark structured streaming with
state.
To: spark-user
Hi Spark-Users,
Hope you are all
saved.
If rollups are not supported, then what is the standard way to handle this?
Regards
Amit Joshi
provide enough memory in
the spark cluster to run both.
Regards
Amit Joshi
On Sat, May 22, 2021 at 5:41 AM wrote:
> Hi Amit;
>
>
>
> Thank you for your prompt reply and kind help. Wonder how to set the
> scheduler to FAIR mode in python. Following code seems to me does not work
&
Hi Jian,
You have to use same spark session to run all the queries.
And use the following to wait for termination.
q1 = writestream.start
q2 = writstream2.start
spark.streams.awaitAnyTermination
And also set the scheduler in the spark config to FAIR scheduler.
Regards
Amit Joshi
On
worker nodes.
Use this command to pass it to driver
*--files /appl/common/ftp/conf.json --conf
spark.driver.extraJavaOptions="-Dconfig.file=conf.json*
And make sure you are able to access the file location from worker nodes.
Regards
Amit Joshi
On Sat, May 15, 2021 at 5:14 AM KhajaAsmath Moh
ope this helps.
Regards
Amit Joshi
On Wed, Apr 7, 2021 at 8:14 PM Mich Talebzadeh
wrote:
>
> Did some tests. The concern is SSS job running under YARN
>
>
> *Scenario 1)* use spark-sql-kafka-0-10_2.12-3.1.0.jar
>
>- Removed spark-sql-kafka-0-10_2.12-3.1.0.jar from any
gt;
> Note that unless I am missing something you cannot access spark session
> from foreach as code is not running on the driver.
>
> Please say if it makes sense or did I miss anything.
>
>
>
> Boris
>
>
>
> *From:* Amit Joshi
> *Sent:* Monday, 18 January 202
st map()/mapXXX() the kafkaDf with the mapping function
> that reads the paths?
>
> Also, do you really have to read the json into an additional dataframe?
>
>
>
> Thanks, Boris
>
>
>
> *From:* Amit Joshi
> *Sent:* Monday, 18 January 2021 15:04
> *To:* spark-user
is approach is fine? Specifically
if there is some problem with
with creating the dataframe after calling collect.
If there is any better approach, please let know the same.
Regards
Amit Joshi
Hi All,
Can someone pls hellp with this.
Thanks
On Tuesday, December 8, 2020, Amit Joshi wrote:
> Hi Gabor,
>
> Pls find the logs attached. These are truncated logs.
>
> Command used :
> spark-submit --verbose --packages org.apache.spark:spark-sql-
> kafka-0-10_2.12:3.0.
f the consumer
> (this is printed also in the same log)
>
> G
>
>
> On Mon, Dec 7, 2020 at 5:07 PM Amit Joshi
> wrote:
>
>> Hi Gabor,
>>
>> The code is very simple Kafka consumption of data.
>> I guess, it may be the cluster.
>> Can you please
;> It's super interesting because that field has default value:
>> *org.apache.kafka.clients.consumer.RangeAssignor*
>>
>> On Mon, 7 Dec 2020, 10:51 Amit Joshi, wrote:
>>
>>> Hi,
>>>
>>> Thnks for the reply.
>>> I did tried removing the client version.
&
s that you are overriding the kafka-clients that comes
>> with spark-sql-kafka-0-10_2.12
>>
>>
>> I'd try removing the kafka-clients and see if it works
>>
>>
>> On Sun, 6 Dec 2020 at 08:01, Amit Joshi
>> wrote:
>>
>>> Hi All,
fault
value. at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:124)*
I have tried setting up the "partition.assignment.strategy", then also
its not working.
Please help.
Regards
Amit Joshi
Can you pls post the schema of both the tables.
On Wednesday, September 30, 2020, Lakshmi Nivedita
wrote:
> Thank you for the clarification.I would like to how can I proceed for
> this kind of scenario in pyspark
>
> I have a scenario subtracting the total number of days with the number of
> ho
Hi,
As far as I know, it depends on whether you are using spark streaming or
structured streaming.
In spark streaming you can write your own code to checkpoint.
But in case of structured streaming it should be file location.
But main question in why do you want to checkpoint in
Nosql, as it's even
in earlier version?
>
> On Thu, Sep 17, 2020 at 11:35 PM Amit Joshi
> wrote:
>
>> Hi,
>>
>> I think problem lies with driver memory. Broadcast in spark work by
>> collecting all the data to driver and then driver broadcasting to all the
>> executors.
Hi,
I think problem lies with driver memory. Broadcast in spark work by
collecting all the data to driver and then driver broadcasting to all the
executors. Different strategy could be employed for trasfer like bit
torrent though.
Please try increasing the driver memory. See if it works.
Regards
Hi,
There is other option like apache Livy which lets you submit the job using
Rest api.
Other option can be using AWS Datapipeline to configure your job as EMR
activity.
To activate pipeline, you need console or a program.
Regards
Amit
On Thursday, September 3, 2020, Eric Beabes
wrote:
> Under
y added topic and partition, if your subscription covers the topic
> (like subscribe pattern). Please try it out.
>
> Hope this helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Fri, Aug 28, 2020 at 1:56 PM Amit Joshi
> wrote:
>
>> Any pointers will be apprec
Any pointers will be appreciated.
On Thursday, August 27, 2020, Amit Joshi wrote:
> Hi All,
>
> I am trying to understand the effect of adding topics and partitions to a
> topic in kafka, which is being consumed by spark structured streaming
> applications.
>
> Do we have
spark structured streaming application to read
from the newly added partition to a topic?
Kafka consumers have a meta data refresh property that works without
restarting.
Thanks advance.
Regards
Amit Joshi
Hi,
I have a scenario where a kafka topic is being written with different types
of json records.
I have to regroup the records based on the type and then fetch the schema
and parse and write as parquet.
I have tried structured programming. But dynamic schema is a constraint.
So I have used DStream
nk doesn't support multiple writers.
It assumes there is only one writer writing to the path. Each query needs
to use its own output directory.
Is there a way to write the output to the same path by both queries, as I
need the output at the same path.?
Regards
Amit Joshi
"empl" => emplSchema
}
getGenericInternalRow(schema)
}
val data = udf(getData)
Spark Version : 2.4.5
Please Help.
Regards
Amit Joshi
33 matches
Mail list logo