I’m just starting using PySpark (Databricks) for a education application. Part 
of this is monitoring children’s online behaviour to alert teachers whether 
there may be problems with bullying, extreme diets, suicide ideation, and so 
on. I have IoT data which I need to combine with information from MariaDB (this 
is all in Azure). I have SparkJDBC42 and mariadb_java_client_2_7_2 jars 
installed. The connection to the database is established, in that I can see it 
can retrieve the schema for tables.

I have a couple of issues. The first is that I can never retrieve any id 
columns (which are all defined as BigInt(20)), as I get a ‘long out of range’ 
error. I’m currently working around that by not including the ids themselves in 
the return. However, the big problem is that I get the column names returned in 
each row instead of the values for each row, where the columns are defined as 
strings (VARCHAR etc). Also, for columns defined as TinyInt they are returned 
as booleans, but reversed (0 is returned as True). I have tried running  the 
SQL outside of databricks/Spark (eg in DataGrip) and it returns perfectly 
sensible data every time.

The code at gist:412e1f3324136a574303005a0922f610 
<https://gist.github.com/alunap/412e1f3324136a574303005a0922f610>


Returned:
+----+------+----+-----------+----+--------------+
|uuid|gender| cpp|young_carer| spp|asylum_refugee|
+----+------+----+-----------+----+--------------+
|uuid|gender|true|       true|true|          true|
|uuid|gender|true|       true|true|          true|
|uuid|gender|true|       true|true|          true|
|uuid|gender|true|       true|true|          true|
|uuid|gender|true|       true|true|          true|
|uuid|gender|true|       true|true|          true|
|uuid|gender|true|       true|true|          true|
|uuid|gender|true|       true|true|          true|
|uuid|gender|true|       true|true|          true|
|uuid|gender|true|       true|true|          true|
+----+------+----+-----------+----+--------------+
only showing top 10 rows

On the database, device.uuid field is VARCHAR(255) and contains valid uuids (no 
nulls). 
children.gender is VARCHAR(255) and contains ‘M’, ‘F’, ‘MALE’, ‘FEMALE’, 
‘NONE’, or null.
children.cpp, young_carer, spp, and asylum_refugee are all tinyint(1) = 0. They 
are nearly all 0, but the first 10 rows contain some nulls.

I tried enclosing the query with brackets ‘(SELECT…) t’ as I gather it is a 
subquery, and I tried adding a WHERE d.uuid = ‘an id’ with an id being one 
where there are no nulls in the column, but no difference. So, completely 
baffled at this point.

Thanks for any suggestions,

Alun ap Rhisiart

Reply via email to