Re: Merge multiple different s3 logs using pyspark 2.4.3

Shraddha Shah Thu, 09 Jan 2020 05:19:44 -0800

Unless I am reading this wrong, this can be achieved with aws sync ?

aws s3 sync
s3://my-bucket/ingestion/source1/y=2019/m=12/d=12
s3://my-bucket/ingestion/processed/
*src_category=other*/y=2019/m=12/d=12


Thanks,
-Shraddha



On Thu, Jan 9, 2020 at 7:05 AM Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> why s3a?
>
> On Thu, Jan 9, 2020 at 2:20 AM anbutech <anbutec...@outlook.com> wrote:
>
>> Hello,
>>
>> version = spark 2.4.3
>>
>> I have 3 different sources json logs data which having same schema(same
>> columns order) in the raw data and want to add one new column as
>> "src_category"  for all the  3 different source to distinguish the source
>> category  and merge all the  3 different sources into the single dataframe
>> to read the json data for the  processing.what is the best way to handle
>> this case.
>>
>> df = spark.read.json(merged_3sourcesraw_data)
>>
>> Input:
>>
>> s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
>> s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
>> s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json
>>
>> output:
>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
>>
>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows
>>
>>
>> Thanks
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Re: Merge multiple different s3 logs using pyspark 2.4.3

Reply via email to