Hi Edmund,

Great to hear! Maybe some follow-up questions:
- What do you use to perform the mapping? The rdflib api or SPARQL
construct somehow?
- What's in your RDBs? Does it contain embedded json (that's what we have)
or do you have plain tables?
- Do you use a PySpark Schema?

 Best,

Miel

Op do 22 apr. 2021 om 02:23 schreef Edmond Chuc <[email protected]>:

> Hi Miel,
>
> You can definitely create ETL pipelines with Apache Spark (using PySpark)
> and RDFLIb. Read the JSON records into a Spark dataframe
> <https://sparkbyexamples.com/pyspark/pyspark-read-json-file-into-dataframe/>
> and create triples in a graph with RDFLib. This is how we are producing
> triples from relational databases with Spark at my workplace. Don't forget
> to repartition the dataframe into multiple chunks to process chunks of the
> same dataframe in parallel.
>
> Cheers,
>
> Edmond
>
> On Wed, Apr 21, 2021 at 10:16 PM Miel Vander Sande <
> [email protected]> wrote:
>
>>
>> Hi all,
>>
>> I'm not sure whether this is the right place for this questions, but
>> AFAIK the RDF python community does not have a general community mailing
>> list like RDF.js?
>>
>> I was wondering whether there were any libraries / efforts using RDFLib
>> to create ETL pipelines for constructing RDF from various sources? I could
>> definitely use something like that, but couldn't really find anything yet.
>> The RML-based tools don't really work that well for my use cases (json
>> records) and they miss some transparency for debugging / iterop with other
>> libraries when producing triples.
>>
>> I was already starting to thing about a possible API and how it could
>> leverage Dask or Spark to really scale up. I'm not a Python/data
>> engineering expert, so this might come across as naive.
>>
>> ```
>> Mapping() # lazy execution pipeline object
>> .load(file1.json) # Creates graph from direct json mapping
>> .construct(query1) # Creates new graph containing mapping from file1.json
>> graph
>> .construct(query2) # Creates new graph containing mapping from file1.json
>> graph
>> .load(file2.json) # Creates graph from direct json mapping
>> .construct(query3) # Creates new map graph containing mapping from
>> file2.json graph
>> .collect() # aggregates all constructed graphs into one
>> .check(shacl) # validate the constructed graph against mapping
>> .run() # actually runs the pipeline
>>
>> ```
>>
>> Best,
>>
>> Miel
>>
>> --
>> http://github.com/RDFLib
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "rdflib-dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/rdflib-dev/32767b1c-ad4c-4c4b-8447-1919154f2427n%40googlegroups.com
>> <https://groups.google.com/d/msgid/rdflib-dev/32767b1c-ad4c-4c4b-8447-1919154f2427n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> http://github.com/RDFLib
> ---
> You received this message because you are subscribed to the Google Groups
> "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/rdflib-dev/CAOuzkyRRrocq88MFnE4j9gWd2Yvp7Pdpg8Ct9mHv40%3DFHTDg2g%40mail.gmail.com
> <https://groups.google.com/d/msgid/rdflib-dev/CAOuzkyRRrocq88MFnE4j9gWd2Yvp7Pdpg8Ct9mHv40%3DFHTDg2g%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
http://github.com/RDFLib
--- 
You received this message because you are subscribed to the Google Groups 
"rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rdflib-dev/CAHeRLWu3vFPrQifgidWT-VpruDsd4kKgnNf3EAz02fYtnzGnnA%40mail.gmail.com.

Reply via email to