Re: A handy tool called spark-column-analyser

[email protected] Tue, 21 May 2024 13:24:07 -0700

 Great work. Very handy for identifying problems
thanks
    On Tuesday 21 May 2024 at 18:12:15 BST, Mich Talebzadeh 
<[email protected]> wrote:  
 
 A colleague kindly pointed out about giving an example of output which wll be 
added to README
Doing analysis for column Postcode
Json formatted output
{    "Postcode": {        "exists": true,        "num_rows": 93348,        
"data_type": "string",        "null_count": 21921,        "null_percentage": 
23.48,        "distinct_count": 38726,        "distinct_percentage": 41.49    }}
Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom




   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 



Disclaimer: The information provided is correct to the best of my knowledge but 
of course cannot be guaranteed . It is essential to note that, as with any 
advice, quote "one test result is worth one-thousand expert opinions (Werner 
Von Braun)".


On Tue, 21 May 2024 at 16:21, Mich Talebzadeh <[email protected]> wrote:


I just wanted to share a tool I built called spark-column-analyzer. It's a 
Python package that helps you dig into your Spark DataFrames with ease. 

Ever spend ages figuring out what's going on in your columns? Like, how many 
null values are there, or how many unique entries? Built with data preparation 
for Generative AI in mind, it aids in data imputation and augmentation – key 
steps for creating realistic synthetic data.

Basics
   
   - Effortless Column Analysis: It calculates all the important stats you need 
for each column, like null counts, distinct values, percentages, and more. No 
more manual counting or head scratching!
   - Simple to Use: Just toss in your DataFrame and call the analyze_column 
function. Bam! Insights galore.
   - Makes Data Cleaning easier: Knowing your data's quality helps you clean it 
up way faster. This package helps you figure out where the missing values are 
hiding and how much variety you've got in each column.
   - Detecting skewed columns
   - Open Source and Friendly: Feel free to tinker, suggest improvements, or 
even contribute some code yourself! We love collaboration in the Spark 
community.

Installation: 


Using pip from the link: https://pypi.org/project/spark-column-analyzer/

pip install spark-column-analyzer

Also you can clone the project from gitHub

git clone https://github.com/michTalebzadeh/spark_column_analyzer.git

The details are in the attached RENAME file
Let me know what you think! Feedback is always welcome.

HTH

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 



Disclaimer: The information provided is correct to the best of my knowledge but 
of course cannot be guaranteed . It is essential to note that, as with any 
advice, quote "one test result is worth one-thousand expert opinions (Werner 
Von Braun)".

Re: A handy tool called spark-column-analyser

Reply via email to