Love metadata, love technology
Crossref makes research outputs easy to find, cite, link, and assess. They are a not-for-profit membership organization that exists to make scholarly communications better.
I spoke with Joe Wass, Principal R&D Engineer in Crossref Labs, to understand how Crossref are using Clojure and what problems it solves for them.
Mark: Can you tell us a little bit about the history of Crossref?
Joe: Crossref was founded by a group of scholarly publishers who essentially wanted to solve link-rot for citations. When an author publishes a journal article, their references section can span articles published by any number of different publishers. Up to the point that Crossref was founded, they made individual agreements between themselves about how to identify and link articles, and how to link into to each others' websites. This kind of thing doesn't scale, so they created Crossref, a non-profit organization, that provides shared infrastructure and technology for the scholarly community.
We allow our members to register content and assign DOIs, Digital Object Identifiers, to their content items. Those DOIs are used to uniquely identify each piece of content. They are expressed as URLs, and when you follow one it redirects to the article's landing page. This link redirection allows for publishers to update their links if their websites change or, as often happens, journals get transferred between publishers.
DOIs are the foundation for all the work we do, but there's a lot more to it than that. We operate a central store of metadata - think titles, authors, links to funders, tables of contents. It mostly boils down to links, which means that we bridge quite a few communities. We share the metadata without restriction, and it ends up in a lot of different places. For each of the things we link to, we need to find and, if necessary create, an identifier system to do the job. One notable example is ORCID.org, an identifier system for authors.
Today we have 11,000 members and 100 million registered content items.
Mark: That's awesome. What kind of tech stack are you using to make all of that possible?
Joe: Our core tech stack goes back at least 15 years, and is written in Java. In Crossref Labs we do new development, for example our REST API and our new Event Data services, in Clojure. For Event Data, where I currently spend most of my time, we have services written in Clojure. They communicate using HTTP, and we lean heavily on Liberator and Apache Kafka.
These services run in Docker containers, which makes development, deployment and configuration management so much easier. We've chosen to use Docker Swarm for service management, as it strikes a nice balance between simple configuration, service discovery, scaling and load balancing. To provision our infrastructure we use Terraform against AWS and Hetzner Cloud.
In one way or another 'immutability' plays a role in all layers of our stack. Obviously Clojure's an excellent choice for stateless functional programming. Docker containers provide a well-understood stateless abstraction. And the "immutable infrastructure" that Terraform brings -- declarative specification of infrastructure deployment -- means that we're able to check nearly everything into source control. That means we can take shared ownership, where everything is versioned and self-documenting. It also means that we could press a button and have everything re-create itself from scratch.
Mark: You already touched on it but where is Clojure making life easier for you?
Joe: There are two broad reasons we use Clojure. Firstly, a lot of the work we do is basically taking data in one format and converting it into another, and the data pipeline oriented approach aligns well with the Clojure philosophy. We don't need stateful systems of objects mutating each other, but we do need to model various domains and transform representations. Running on the JVM is also handy, there are some indispensable Java libraries out there that we already use, for example the AWS client SDK and Kafka.
The second reason is simply that it's a really cool language. I'm afraid this isn't a very scientific answer, but when you have to come into work every day, sit down and write code, doing it in a language you love is great. I think you write better software when you can focus at the right level of abstraction without irrelevant things getting in your way.
My background includes C#, Objective-C, Python and friends, so I was initially skeptical about losing mutability and object-orientation, but it turns out most things I want to do are well modelled in the language, and there are some nicely designed little affordances that are handy when you need them. From threading macros to atoms to pmap.
Mark: You mentioned having 100 million registered content items, any notable performance stories, issues with either Clojure or the JVM?
Joe: We use SOLR and ElasticSearch, we have had to do a bit of tuning with those. The pipeline-oriented architecture we have means that we can scale out easily, bit of a boring answer, but from where I'm standing, boring is good!
Mark: What are you using for front end development, have you looked at ClojureScript?
Team & Recruitment
Mark: How many developers are coding Clojure at Crossref?
Joe: Currently two, with two more new colleagues learning at the moment. Introducing someone to Clojure is a fascinating process because it makes you look at the language with fresh eyes. There are so many idioms common to other languages but with subtly and sharply different semantics. Unpicking assumptions that are valid in almost any other language, like bound symbols vs variables, has cemented my admiration for Clojure!
Mark: Are you a remote team? If so what is the balance, how much of your work is local vs remote?
Joe: Although we have offices in Oxford, UK and near Boston, USA, a substantial portion of us are distributed around Europe and the USA and work remotely full time. All the work we do is effectively remote, using Slack, GitHub and plenty of voice calls. We also make a point to get together on a regular basis to keep up the human contact side of things.
Because the work we do in Crossref Labs is generally open source, we work extra hard to make sure everything is thoroughly commented, documented and tested, so that anyone could pick it up. This hopefully makes it easier to collaborate when we need to.
Mark: How have you found recruitment?
Joe: In Crossref Labs we perform a very interesting and, I think, slightly unusual role. Yes, we write software, but we're also out in the scholarly publishing and academic communities trying to spot new trends and collaborating with groups and individuals to try and build new infrastructure.
It's certainly the most domain-intensive developer job I've done, so when we're recruiting we look for more than just experience in a certain language. A history of multiple languages is beneficial, and it's more important to be able to learn on the job than to arrive with Clojure knowledge.
Because we're looking for a particular profile of developer, recent hiring took a little longer. There was a period of time when it was very useful to be able to work with JUXT to keep our velocity up!
Mark: Have you had any notable bumps in the road using Clojure?
Joe: I think the only annoyance that springs to mind is error handling. Anonymous and higher-order-functions sometimes cause head-scratching, and when an exception is uncaught it's usually something unexpected and counter-intuitive, ergo harder to debug.
Reading other people's highly abstracted code, especially when it's highly idiomatic, can be a bit of a challenge sometimes. I love the level of abstraction that Clojure encourages, but sometimes the concepts that functions express can be a little difficult to get a handle on.
Mark: Have you had much involvement with the Clojure community? How have you found it? Have you found libraries easy to come by?
Joe: I'd like to say that I have, but in truth I'm probably a bit behind the times. I've identified the tools I need and find them very effective to solve the problems I have. Some of the tools out there are exceptionally good. Liberator, clj-time, http-kit, lein, ring to name a few. So the community of developers writing libraries is top notch.
I went to Euroclojure in Krakow a few years ago and ended up chatting to Rich Hickey. He was so friendly, accessible and had time for a newbie Clojure developer. I have boundless gratitude to him, his vision, execution and everyone working to make the Clojure ecosystem a better place.
I do attend and speak at conferences, but they're mostly about science communication and scholarly infrastructure!