Peter Ansell raised a valid question about why in Commons RDF I made the 
RDFParser interface as an immutable factory-like builder pattern.

https://commons.apache.org/proper/commons-rdf/apidocs/org/apache/commons/rdf/experimental/RDFParser.html


Here is how it can be used today:

 RDF rdf = new JenaRDF();
 Graph g1 = rdf.createGraph();
 Future future = new JenaRDFParser()
         .source(Paths.get("/tmp/graph.ttl"))
         .contentType(RDFSyntax.TURTLE)
         .target(g1).parse();
 future.get(30, TimeUnit.Seconds); // block
 

You may notice it is not a factory as-such, as you only get a Future
of the parser result, rather than the actual "parser". Also it is
currently an interface, with multiple implementations (one per RDF
implementation).



Although the javadoc only suggests for it to be immutable and
threadsafe, the abstract class AbstractRDFParser
https://commons.apache.org/proper/commons-rdf/apidocs/org/apache/commons/rdf/simple/experimental/AbstractRDFParser.html
https://github.com/apache/commons-rdf/blob/master/simple/src/main/java/org/apache/commons/rdf/simple/experimental/AbstractRDFParser.java
is both, and thus every "option setter" method therefore return a cloned
copy of the factory with the mutated fields set there. This simply uses 
.clone() at the moment.

This abstract class and immutability pattern is used by each of the
three RDF implementations of the interface.


This means it's safe to parse many files like this:

 JenaRDFParser parser = new JenaRDFParser()
         .contentType(RDFSyntax.TURTLE)
         .target(g1);
  
 // Parse multiple files in parallell
 Future future1 = parser.source(Paths.get("/tmp/graph1.ttl")).parse();
 Future future2 = parser.source(Paths.get("/tmp/graph2.ttl")).parse();

 // Wait for both to finish
 future1.get(30, TimeUnit.Seconds); 
 future2.get(30, TimeUnit.Seconds);


Above the second call to parser.source(..) is not a problem (even if
it happens in another thread) as both .source() calls return separate
cloned JenaRDFParser instances and don't mutate the "parser" instance.


Now a valid critique of such a pattern is that it's wasteful to make
clones that are immediately thrown away, for instance the above
future1/future2 example would cause 7 instantiations of JenaRDFParser()
(.parse() does some finalization in a final clone).  

I have not done any benchmarking, but envision that as parsing RDF files
generally creates many thousands of RDFTerm instances, a couple of
extra parser instantiations shouldn't make a big dent in the bill.


The typical builder pattern is rather to return "this" and mutate fields
directly. This is obviously not thread-safe but generally achives the
same. Multiple threads would however instead have to create a brand new
parser from start and set all the options themselves.  (Or they could a
similar .clone() method from a carefully locked down "proto" instance
that everyone has to be careful not to call any mutating methods on)


Now multiple threads is now very common because of Java 8 streams, and
for instance it would not be unthinkable that a parallelStream() is
mapped as a .source() or .target() on such a parser builder. The Futures
fit somewhat into this picture as well.   I think this would get
complicated with having to set everything again rather than just pass
"parser::source" as a method reference.




As this was just thrown together as an experiment (and we've not yet
tried writing the corresponding RDFWriter interface), I think we could
revisit how to do this pattern, hopefully drawing on the wider
experience of the Apache Commons developers on this list.



Roughly here are my requirements:

* Ability to set various options (e.g. path of file to parse, format,
  destination) in any order
* Avoid mega-constructors/multi-arg methods 
* Some options take multiple types, e.g a source can be given as a Path,
  InputStream or IRI (URL) -- but would override each other

Nice to have:

* Reusable (e.g. set options that are common, but modify
  source() to parse multiple files)
* Thread-safe (e.g. immediate reuse of builder before a parsing session
  has started or finished) 
* Easy to serialize / keep around (should not keep references to big
  objects in memory uneccessarily)
* Parsing in the background, e.g. return Future. (implementation manages
  threads/queues as it pleases)

Unclear:

* How to select between the multiple RDFParser instances? 
  I was thinking the RDF interface could have a .makeParser() method, 
  but not all parsers can take all formats, so possibly some kind of 
  registry or "what can you handle" mechanism might be needed.


How do you think we should proceed? Mutable or immutable? How should the
settings be kept? Fields, map, or what?  Does it make sense with an
interface, abstract class (keeps settings) and implementations
(processess settings), or should we have a single ParserFactory class
and have a new internal interface below?


-- 
Stian Soiland-Reyes
University of Manchester
http://www.esciencelab.org.uk/
http://orcid.org/0000-0001-9842-9718


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to