Hi Edmund, Great to hear! Maybe some follow-up questions: - What do you use to perform the mapping? The rdflib api or SPARQL construct somehow? - What's in your RDBs? Does it contain embedded json (that's what we have) or do you have plain tables? - Do you use a PySpark Schema?
Best, Miel Op do 22 apr. 2021 om 02:23 schreef Edmond Chuc <[email protected]>: > Hi Miel, > > You can definitely create ETL pipelines with Apache Spark (using PySpark) > and RDFLIb. Read the JSON records into a Spark dataframe > <https://sparkbyexamples.com/pyspark/pyspark-read-json-file-into-dataframe/> > and create triples in a graph with RDFLib. This is how we are producing > triples from relational databases with Spark at my workplace. Don't forget > to repartition the dataframe into multiple chunks to process chunks of the > same dataframe in parallel. > > Cheers, > > Edmond > > On Wed, Apr 21, 2021 at 10:16 PM Miel Vander Sande < > [email protected]> wrote: > >> >> Hi all, >> >> I'm not sure whether this is the right place for this questions, but >> AFAIK the RDF python community does not have a general community mailing >> list like RDF.js? >> >> I was wondering whether there were any libraries / efforts using RDFLib >> to create ETL pipelines for constructing RDF from various sources? I could >> definitely use something like that, but couldn't really find anything yet. >> The RML-based tools don't really work that well for my use cases (json >> records) and they miss some transparency for debugging / iterop with other >> libraries when producing triples. >> >> I was already starting to thing about a possible API and how it could >> leverage Dask or Spark to really scale up. I'm not a Python/data >> engineering expert, so this might come across as naive. >> >> ``` >> Mapping() # lazy execution pipeline object >> .load(file1.json) # Creates graph from direct json mapping >> .construct(query1) # Creates new graph containing mapping from file1.json >> graph >> .construct(query2) # Creates new graph containing mapping from file1.json >> graph >> .load(file2.json) # Creates graph from direct json mapping >> .construct(query3) # Creates new map graph containing mapping from >> file2.json graph >> .collect() # aggregates all constructed graphs into one >> .check(shacl) # validate the constructed graph against mapping >> .run() # actually runs the pipeline >> >> ``` >> >> Best, >> >> Miel >> >> -- >> http://github.com/RDFLib >> --- >> You received this message because you are subscribed to the Google Groups >> "rdflib-dev" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/rdflib-dev/32767b1c-ad4c-4c4b-8447-1919154f2427n%40googlegroups.com >> <https://groups.google.com/d/msgid/rdflib-dev/32767b1c-ad4c-4c4b-8447-1919154f2427n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > http://github.com/RDFLib > --- > You received this message because you are subscribed to the Google Groups > "rdflib-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/rdflib-dev/CAOuzkyRRrocq88MFnE4j9gWd2Yvp7Pdpg8Ct9mHv40%3DFHTDg2g%40mail.gmail.com > <https://groups.google.com/d/msgid/rdflib-dev/CAOuzkyRRrocq88MFnE4j9gWd2Yvp7Pdpg8Ct9mHv40%3DFHTDg2g%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- http://github.com/RDFLib --- You received this message because you are subscribed to the Google Groups "rdflib-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/CAHeRLWu3vFPrQifgidWT-VpruDsd4kKgnNf3EAz02fYtnzGnnA%40mail.gmail.com.
