Re: [pyspark] Load a master data file to spark ecosystem

2020-04-27 Thread Arjun Chundiran
Below is the reason, why I didn't use dataframes directly As per my understanding, While creating the data frame, SPARK creates the file into partitions and make it distributed. But my tree file contains the data structured in radix tree format. tree_lookup_value is the method which we use to look

Re: [pyspark] Load a master data file to spark ecosystem

2020-04-27 Thread Arjun Chundiran
Hi Gourav, I am first creating rdds and converting it into dataframes, since I need to map the value from my tree file while making the data frames Thanks, Arjun On Sun, Apr 26, 2020 at 9:33 PM Gourav Sengupta wrote: > Hi, > > Why are you using RDDs? And how are the files stored in terms if >

Re: [pyspark] Load a master data file to spark ecosystem

2020-04-27 Thread Arjun Chundiran
Hi Roland, As per my understanding, While creating the data frame, SPARK creates the file into partitions and make it distributed. But my tree file contains the data structured in radix tree format. tree_lookup_value is the method which we use to look up for a specific key in that tree. So I don't

Re: [pyspark] Load a master data file to spark ecosystem

2020-04-27 Thread Arjun Chundiran
Hi Sonal, The tree file is a file in radix tree format. tree_lookup_value is a function which looks up the value for a particular value in key. Thanks, Arjun On Sat, Apr 25, 2020 at 10:28 AM Sonal Goyal wrote: > How does your tree_lookup_value function work? > > Thanks, > Sonal > Nube Technolo

Re: [pyspark] Load a master data file to spark ecosystem

2020-04-26 Thread Edgardo Szrajber
In the below  code you are impeding Spark from doing what is meant to do.As mentioned below, the best (and easiest to implement) aproach would be to load each file into a dataframe and join between them.Even doing a key join with RDDS would be better, but in your case you are forcing a one by on

Re: [pyspark] Load a master data file to spark ecosystem

2020-04-26 Thread Gourav Sengupta
Hi, Why are you using RDDs? And how are the files stored in terms if compression? Regards Gourav On Sat, 25 Apr 2020, 08:54 Roland Johann, wrote: > You can read both, the logs and the tree file into dataframes and join > them. Doing this spark can distribute the relevant records or even the >

Re: [pyspark] Load a master data file to spark ecosystem

2020-04-25 Thread Roland Johann
You can read both, the logs and the tree file into dataframes and join them. Doing this spark can distribute the relevant records or even the whole dataframe via broadcast to optimize the execution. Best regards Sonal Goyal schrieb am Sa. 25. Apr. 2020 um 06:59: > How does your tree_lookup_valu

Re: [pyspark] Load a master data file to spark ecosystem

2020-04-24 Thread Sonal Goyal
How does your tree_lookup_value function work? Thanks, Sonal Nube Technologies On Fri, Apr 24, 2020 at 8:47 PM Arjun Chundiran wrote: > Hi Team, > > I have asked this question in stack overflow >

[pyspark] Load a master data file to spark ecosystem

2020-04-24 Thread Arjun Chundiran
Hi Team, I have asked this question in stack overflow and I didn't really get any convincing answers. Can somebody help me to solve this issue? Below is my problem While building a log processing system, I