crepererum - (Re-)Writing Digital Library Software

During the last summer, I was working for the Invenio team. Now, after some months of pause I think I am able to talk about this piece of software with a certain amount of distance. To get to the point: I would rewrite it, and here is why and how.

Preface

You may wonder what this Digital Library Software thing is about: It is a (mostly Web-)Software that enables libraries to

store information about their books and other analog “records” as well as digital publications (like papers, but also data, graphics and source code)
search it using quite complex queries
provide metadata using various formats
handle different user roles
link to other libraries
handle archiving of nearly everything during nearly every step of data input and at the same time following quite hard standards of Digital Preservation (that is the magic term)
enable additional services (e.g. messaging, book renting, rating, social media integration).

The software is highly customizable using plugins and the website should always match the branding guidelines of the library/university/organization.

Programming Language

Invenio is written in Python. I like that language, but for this kind of project it is a disaster. Apart from the fact that the dynamic nature of the language let to some horrible overengineered constructs, the fact that you do not need to compile it is wasting so many working hours. It is not only about not checking types ahead of production, but also about checking function signatures and the existence of methods you are calling. Refactoring the code is hell.

Another point that you might believe or not: the performance is not great. You can do better than just buying more and more servers for a small library or waiting over 200ms until you get a usable response from the server.

Getting a proper, static typed and compiled language is key for distributed development and refactoring (which will occur at some point of your development process). Two languages to consider: Scala and Go.

Data Storage

Currently Invenio stores everything useful as a dynamic typed record (JSON + JSON Schema) and some stuff as normal database tables. Make everything (apart from configs) a record. Period. This simplifies plugin handling, archiving, backups (these two are not necessarily the same) and signal handling.

Also, make records immutable. Once they are stored, you cannot change them. You can create a new one based on the old one in a non-linear, GIT-like history (by referring IDs, see section about cryptography).

Storing records as JSON gives you a very high flexibility to store highly nested records and extend given JSON Schemas how you want. But how often do you have deeply nested records and how fast can you process billions or even trillions of records (which happens when you store HEP data)? Sorry, but my small library cannot effort 4 database servers. So choose a database schema for your different records and store them in a proper, old-school database. It is faster, safer and flexible enough for 99% of the use cases.

Frontend and API

Instead of using an internal API to generate the web output and a REST API to serve the remaining world, build one proper, fast API and serve the web page from it as well.

Some words about the frontend: no matter what framework you use, use one and force plugin developers to use it as well. Also use one icon set, one CSS style and a single JavaScript standard (or TypeScript, or CoffeeScript or whatever). Build your assets during product compilation, not at runtime. Make the CSS styling and icon sets exchangeable. Teach the people on how to use modern standards (e.g. responsive design, vector graphics for icons and logos, accessibility).

Be smart when it comes to translations. Purely string based translations, like most systems based on gettext can lead to dump results, because you assume that

Same English text ⇒ Same context
Translation is possible only by seeing the string
Non-Experts can grasp the context only based on strings, but without ever seeing them used by the frontend

Developers First

A major bumper when it comes to good contribution is a complex and long way to set up your project. If it takes more than 3 commands in your terminal to get a fully workable and debuggable instance of the entire project, including dependencies like databases and message queues, you are doing something wrong. And do not expect that developers will pollute their entire system with your product. In fact, do not even require them to install any database on their system.

Another thing are short but precise contribution guidelines, without implicitly relying on other (sometimes very old) guidelines like the GNU coding standards (which, by the way, are a glorious example of too long and outdated). If it takes more than 3 minutes to read them, they are too long. Period. Ensure that contributors do not have to wait longer than 2 workdays to get a first comment on their (reasonable sized) pull requests and never longer than 4 workdays from “ready to merge” to merge without a good reason. To archive this, do not rip your software into 100 pieces and put them into separate repositories, especially when you only rely on GitHub without any external issue and review solution. Ideally, automate merging and testing with tools like Homu and forbid direct commits even by project leaders.

It’s sufficient if it pops up when typing in the right thing into the search box.

That is my very favorite bullshit sentence about documentation. How should someone know what to type in? Provide a short but complete overview, not only about the APIs, but also about the assumptions and standards of your project. For Digital Library Software that is for example: SIP, AIP and (sadly) MARC 21. Do not forget to keep your docs up-to-date, especially when changing core technologies or deployment methods.

Last but not least: Simplify testing, both the execution and the creation of new tests. It is unacceptable that a developer has to run a 1 hour test on a local system for every single commit only to ensure some very basic components are tested. Provide a complete infrastructure to run integration and regression tests so developers can use their machines to develop and let servers do the automated hard work. BTW: Everyone has to provide and accept tests, including team leaders.

Trust and Hard Cryptography

Nowadays libraries are not only providing historical books and novels, but also information that critical for political, economical, scientific and ethnic reasons and that the users have to trust in. Shockingly no major only repository or library provides a useful trust model, so here is my humble proposal: Assuming that

Everything is an immutable record like described above and \(C\) is the content of the record.
Every submitter has a public key \(P\) and a secret key \(S\). (see Public-key Cryptography)

we follow the following procedure:

Sign the payload and appends that signature the payload to create a record \(R=C\,|\,\mathrm{sign}(C,S)\).
Keep the record secret and hand in (optionally using an anonymous way like TOR) a hash of the record \(I=\mathrm{hash}(R)\) which at the same time will be the record ID.
The library (or a trustworthy foundation) adds the record ID to a public block chain (not the Bitcoin one, but a similar design approach).
When the amount of confirmation in the block chain is sufficient, transmit the record \(R\) to the library to request the final publication.

This method ensures that you cannot alter a record after publication (integrity) and that the submitter can prove that she was the person who is the first submitter (authorship claiming). Some notes:

The record ID can be used to upgrade a record (i.e. creating a new immutable one referring the old one) or to annotate records (e.g. with comments, ratings, reviews, extracted metadata, citation).
Records can be encrypted by encrypting the content before submitting it, so that the record than is \(R=C’\,|\,\mathrm{sign}(C’,S)\) with \(C’=\mathrm{encrypt}(C,P’)\) and \(P’\) being whatever public key.
Records can be deleted from the library but you cannot remove them from the block chain.
The number of confirmations between handing in the record ID and doing the actual publication depends on the level of security you want. For messages or extracted metadata that might be lower than for a highly controversial paper.
If the Digital Library System generates a record automatically, it also needs a key pair. The secret key must kept secret, like a private key for TLS.
To keep the signature overhead low, I suggest using elliptic curve cryptography and a trustworthy curve like ed25519.
I suggest also implementing a key registry to simplify key handling and giving authors the ability to approve (i.g. sign) each others keys.
Ensure that you use hashes that are long enough to avoid collisions but short enough to enable efficient processing.

Disclaimer: I have worked for the Invenio core developer team at CERN. While this text is loosely based on my experience there, it does not contain internal information. The development of Invenio is publicly available at GitHub. Also, I can say (without getting paid some extra amount 😉) that I learned a lot during my time at CERN and be thankful for the opportunity to work with the guys there. Feel free to check out my final presentation about my work there.

Image: “1800s Library” by Barta IV, CC BY 2.0, 2013

Contents