High Assurance Functional RDF in Haskell

As 2011 winds to a close I want to share the results of my recent code writing project.  The purpose of the project is to demonstrate each of the capabilities suggested by the title above in a working solution and to develop a set of practices to deliver such a solution. After quite some time in the Haskell woodshed, I finally have sufficient tacit knowledge to run the interpreter and compiler and productively diagnose their feedback. This post provides the opportunity to get additional feedback from the Haskell community on how to improve both the code and practices.

The simple use case is to display to the operator of a terminal the elapsed time in seconds to load a file of RDF data. The single assumption is that the data is encoded in JSON according to this Talis specification. The single constraint is that the load time must be less than 10 seconds for a 2.5MB load.

The non-functional capabilities of the solution are more interesting. I use the encodings in Paul Tarau’s Isomorphism library both to encode the data as numbers then later recover the original text. The functions to reduce the data to a directed graph (r2dig), the function that finds the indices (ndcs) and those that recover the text from the encoding (recover and decode) are the most original work. The solution includes a graph matching function indirectly related to the use case. This solution could be developed further as a triple store with various functions that take advantage of the encodings.

The high assurance reference speaks to use of the GHC type system as well the practice of verification through QuickCheck properties. This practice implies more than coding properties. It implies devising a verification approach that both avoids selection bias and proves mathematical properties specific to the solution. No doubt the Haskell type system supports the high assurance claim. With some excellent advice from David McBride, the Talis specification of the JSON encoding provided sufficient guidance to construct then refactor data types and associated instances of the FromJSON type class in Bryan O’Sullivan’s Aeson library.

The functional reference speaks to evolving my Haskell style.  The functional style became more natural to me as I further developed the solution. Haskell is the fourth programming language I’ve adopted in my career.

As important as the language are the practices in delivering a high assurance, high(er) performing solution. In addition to documentation with HUnit and verification with QuickCheck, those practices include the following: examining symmetric multiprocessor parallelism (smp), profiling, exploring heuristics (-K and -H flags) in the compiler and run time system, inlining, optimizing for fitness with a genetic algorithm.

Inlining and heuristics were the most useful practices to reduce execution time. A manual binary search, above and below abductive thresholds allowed the use of heuristics to reduced execution time.  Inlining the r2dig and ndcs functions reduced execution time from 10s to a little more than 5s on my Quadcore i7. Pure SMP was not fruitful. Early indications were that an SMP approach would  be comparable to a single thread. I did not explore Parallel IO. I welcome any suggestion on how to incorporate SMP.  I ran a half dozen simply configured Acovea sessions with some optimization. See configuration and reports.

You will find the files on github/hafr. I welcome feedback on refactoring, style, performance tuning, etc.

Looking ahead to 2012, I will be extending my high assurance capabilities by investing more time in Hets. The functional target is Information Flow. I’m still waffling on the Isabelle v. Agda decision.  You can do that in Washington.