Re: A handy tool called spark-column-analyser

Mich Talebzadeh Tue, 21 May 2024 10:11:59 -0700

A colleague kindly pointed out about giving an example of output which wll
be added to README


Doing analysis for column Postcode

Json formatted output

{
    "Postcode": {
        "exists": true,
        "num_rows": 93348,
        "data_type": "string",
        "null_count": 21921,
        "null_percentage": 23.48,
        "distinct_count": 38726,
        "distinct_percentage": 41.49
    }
}

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 21 May 2024 at 16:21, Mich Talebzadeh <[email protected]>
wrote:

> I just wanted to share a tool I built called *spark-column-analyzer*.
> It's a Python package that helps you dig into your Spark DataFrames with
> ease.
>
> Ever spend ages figuring out what's going on in your columns? Like, how
> many null values are there, or how many unique entries? Built with data
> preparation for Generative AI in mind, it aids in data imputation and
> augmentation – key steps for creating realistic synthetic data.
>
> *Basics*
>
>    - *Effortless Column Analysis:* It calculates all the important stats
>    you need for each column, like null counts, distinct values, percentages,
>    and more. No more manual counting or head scratching!
>    - *Simple to Use:* Just toss in your DataFrame and call the
>    analyze_column function. Bam! Insights galore.
>    - *Makes Data Cleaning easier:* Knowing your data's quality helps you
>    clean it up way faster. This package helps you figure out where the missing
>    values are hiding and how much variety you've got in each column.
>    - *Detecting skewed columns*
>    - *Open Source and Friendly:* Feel free to tinker, suggest
>    improvements, or even contribute some code yourself! We love collaboration
>    in the Spark community.
>
> *Installation:*
>
> Using pip from the link: https://pypi.org/project/spark-column-analyzer/
>
>
> *pip install spark-column-analyzer*
> Also you can clone the project from gitHub
>
>
> *git clone https://github.com/michTalebzadeh/spark_column_analyzer.git
> <https://github.com/michTalebzadeh/spark_column_analyzer.git>*
>
> The details are in the attached RENAME file
>
> Let me know what you think! Feedback is always welcome.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>

Re: A handy tool called spark-column-analyser

Reply via email to