Re: Testing ETL with Spark using Pytest

Jerry Vinokurov Tue, 09 Feb 2021 07:48:09 -0800

Hi Mich,

I'm a bit confused by what you mean when you say that you cannot call a
fixture in another fixture. The fixtures resolve dependencies among
themselves by means of their named parameters. So that means that if I have
a fixture


@pytest.fixture
def fixture1():
    return SomeObj()

and another fixture

@pytest.fixture
def fixture2(fixture1)
    return do_something_with_obj(fixture1)

my second fixture will simply receive the object created by the first. As
such, you do not need to "call" the second fixture at all. Of course, if
you had some use case where you were constructing an object in the second
fixture, you could have the first return a class, or you could have it
return a function. In fact, I have fixtures in a project that do both. Here
they are:

@pytest.fixture
def func():

    def foo(x, y, z):

        return (x + y) * z

    return foo

That's a fixture that returns a function, and any test using the func
fixture would receive that actual function as a value, which could then be
invoked by calling e.g. func(1, 2, 3). Here's another fixture that's more
like what you're doing:


@pytest.fixture
def data_frame():

    return pd.DataFrame.from_records([(1, 2, 3), (4, 5, 6)],
columns=['x', 'y', 'z'])

This one just returns a data frame that can be operated on.

Looking at your setup, I don't want to say that it's wrong per se (it could
be very appropriate to your specific project to split things up among these
many files) but I would say that it's not idiomatic usage of pytest
fixtures, in my experience. It feels to me like you're jumping through a
lot of hoops to set up something that could be done quite easily and
compactly in conftest.py. I do want to emphasize that there is no
limitation on how fixtures can be used within functions or within other
fixtures (which are also just functions), since the result of the fixture
call is just some Python object.

Hope this helps,
Jerry

On Tue, Feb 9, 2021 at 10:18 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> I was a bit confused with the use of fixtures in Pytest with the
> dataframes passed as an input pipeline from one fixture to another. I wrote
> this after spending some time on it. As usual it is heuristic rather than
> anything overtly by the book so to speak.
>
> In PySpark and PyCharm you can ETTL from Hive to BigQuery or from Oracle
> to Hive etc. However, for PyTest, I decided to use MySql as a database of
> choice for testing with a small sample of data (200 rows). I mentioned
> Fixtures. Simply put "Fixtures are* functions, which will run before each
> test function to which it is applied, to prepare data*. Fixtures are used
> to feed some data to the tests such as database connections". If you have
> ordering like Read data (Extract), do something with it( Transform) and
> save it somewhere (Load), using Spark then these are all happening in
> memory with data frames feeding each other.
>
> The crucial thing to remember is that fixtures pass functions to each
> other as parameters not by invoking them directly!
>
> Example  ## This is correct @pytest.fixture(scope = "session") def
> transformData(readSourceData):  ## fixture passed as parameter # this is
> incorrect (cannot call a fixture in another fixture) read_df =
> readSourceData()  So This operation becomes
>
>  transformation_df = readSourceData. \ select( \ ....
>
> Say in PyCharm under tests package, you create a package "fixtures" (just
> a name nothing to do with "fixture") and in there you put your ETL python
> modules that prepare data for you. Example
>
> ### file --> saveData.py @pytest.fixture(scope = "session") def
> saveData(transformData): # Write to test target table try: transformData. \
> write. \ format("jdbc"). \ ....
>
>
> You then drive this test by creating a file called *conftest.py *under*
> tests* package. You can then instantiate  your fixture files by
> referencing them in this file as below
>
> import pytest from tests.fixtures.extractHiveData import extractHiveData
> from tests.fixtures.loadIntoMysqlTable import loadIntoMysqlTable from
> tests.fixtures.readSavedData import readSavedData from
> tests.fixtures.readSourceData import readSourceData from
> tests.fixtures.transformData import transformData from
> tests.fixtures.saveData import saveData from tests.fixtures.readSavedData
> import readSavedData
>
> Then you have your test Python file say *test_oracle.py* under package
> tests and then put assertions there
>
> import pytest from src.config import ctest
> @pytest.mark.usefixtures("extractHiveData") def
> test_extract(extractHiveData): assert extractHiveData.count() > 0
> @pytest.mark.usefixtures("loadIntoMysqlTable") def
> test_loadIntoMysqlTable(loadIntoMysqlTable): assert loadIntoMysqlTable
> @pytest.mark.usefixtures("readSavedData") def
> test_readSourceData(readSourceData): assert readSourceData.count() ==
> ctest['statics']['read_df_rows']
> @pytest.mark.usefixtures("transformData") def
> test_transformData(transformData): assert transformData.count() ==
> ctest['statics']['transformation_df_rows']
> @pytest.mark.usefixtures("saveData") def test_saveData(saveData): assert
> saveData
> @pytest.mark.usefixtures("readSavedData")
> def test_readSavedData(transformData, readSavedData): assert
> readSavedData.subtract(transformData).count() == 0
>
> This is an illustration from PyCharm about directory structure unders tests
>
>
> [image: image.png]
>
>
> Let me know your thoughts.
>
>
> Cheers,
>
>
> Mich
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>

-- 
http://www.google.com/profiles/grapesmoker

Re: Testing ETL with Spark using Pytest

Reply via email to