[PySpark] Join using condition where each record may be joined multiple times

2022-11-27 Thread Oliver Ruebenacker
Hello, I have two Dataframes I want to join using a condition such that each record from each Dataframe may be joined with multiple records from the other Dataframe. This means the original records should appear multiple times in the resulting joined Dataframe if the condition is fulfilled

Re: [PySpark] Join using condition where each record may be joined multiple times

2022-11-27 Thread Artemis User
What if you just do a join with the first condition (equal chromosome) and append a select with the rest of the conditions after join?  This will allow you to test your query step by step, maybe with a visual inspection to figure out what the problem is. It may be a data quality problem as well

Spark Partitions Size control

2022-11-27 Thread vijay khatri
Hi Team, I am reading data from sql server tables through pyspark and storing data into S3 as parquet file format. In some table I have lots of data so I am getting file size in S3 for those tables in GBs. I need help on this following: I want to assign 128 MB to each partition. How we can assi