Unless I am reading this wrong, this can be achieved with aws sync ? aws s3 sync s3://my-bucket/ingestion/source1/y=2019/m=12/d=12 s3://my-bucket/ingestion/processed/ *src_category=other*/y=2019/m=12/d=12
Thanks, -Shraddha On Thu, Jan 9, 2020 at 7:05 AM Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > why s3a? > > On Thu, Jan 9, 2020 at 2:20 AM anbutech <anbutec...@outlook.com> wrote: > >> Hello, >> >> version = spark 2.4.3 >> >> I have 3 different sources json logs data which having same schema(same >> columns order) in the raw data and want to add one new column as >> "src_category" for all the 3 different source to distinguish the source >> category and merge all the 3 different sources into the single dataframe >> to read the json data for the processing.what is the best way to handle >> this case. >> >> df = spark.read.json(merged_3sourcesraw_data) >> >> Input: >> >> s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json >> s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json >> s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json >> >> output: >> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other >> >> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new >> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows >> >> >> Thanks >> >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>