Understanding records in Koha

Throughout the years, I’ve found several open source ILS and most of them try to water down the way librarians have catalogued resources for years. Yes, we all agree ISO 2709 is obsolete, but MARC has proven to be very complete, and most of the efforts out there (Dublin Core, etc.) try to reduce the expression level a librarian can have. If your beef is with ISO 2709, there’s MARC-XML if you want something that is easier to debug in terms of encoding, etc.

That said, Koha faces a challenge: it needs to balance the expressiveness of MARC with the rigidness of SQL. It also needs to balance the convenience of SQL with the potential shortcomings of their database of choice (MySQL) with large collections (over a couple thousand records) and particularly with searching and indexing.

Koha’s approach to solve this problem is to incorporate Zebra to the mix. Zebra is a very elegant, but very difficult to understand piece of Danish open source software that is very good at indexing and searching resources that can come from, say, MARC. It runs as a separate process (not part of the Web stack) and it can also be enabled as a Z39.50 server (Koha itself is a Z39.50 consumer, courtesy of Perl)

The purpose of this post is to help readers navigate how records are managed in Koha and avoid frustrations when deploying Koha instances and migrating existing records.

Koha has a very simple workflow for cataloguing new resources, either from Z39.50, from a MARC (ISO 2709 or XML) file or from scratch. It has templates for cataloguing, it has the Z39.50 and MARC capabilities, and it has authorities. The use case of starting a library from scratch in Koha is actually a very solid one.

But all of the libraries I’ve worked with in the last 7 years already have a collection. This collection might be ISIS, Documanager, another SQL database or even a spreadsheet. Few of them have MARC files, and even if they had (i.e., vendors provide them), they still want ETLs to be applied (normalization, Z39.50 validations, etc.) that require processing.

So, how do we incorporate records massively into Koha? There are two methods, MARC import or fiddling with SQL directly, but only one answer: MARC import.

See, MARC can potentially have hundreds of fields and subfields, and we don’t necessarily know beforehand which ones are catalogued by the librarians, by other libraries’ librarians or even by the publisher. Trying to water it down by removing the fields we don’t “want” is simply denying a full fidelity experience for patrons.

But, in the other hand, MySQL is not designed to accommodate a random, variable number of columns. So Koha takes the most used attributes (like title or author) and “burns” them into SQL. For multivalued attributes, like subjects or items, it uses additional tables. And then it takes the MARC-XML and shoves it on a entire field.

Whoa. So what happens if a conservatorium is making heavy use of 383b (Opus number) and then want to search massively for this field/subfield combination? Well, you can’t just tell Koha to wait until MySQL loads all the XMLs in memory, blows them up and traverse them – it’s just not gonna happen within timeout.

At this point you must have figured out that the obvious solution is to drop the SQL database and go with a document-oriented database. If someone just wants to catalog 14 field/subfields and eventually a super detailed librarian comes in and starts doing 150, you would be fine.

Because right now, without that, it’s Zebra that kicks in. It behaves more like an object storage and it’s very good at searching and indexing (and it serves as Z39.50 server, which is nice) but it’s a process running separately and management can sometimes be harsh.

Earlier we discussed the use case where Koha excels: creating records from scratch. Does this mean that Koha won’t work for an existing collection? No. It just means the workflows are a tad more complicated.

I write my own Perl code to migrate records (some scripts available here, on the move to GitHub), and the output is always MARC. In the past I’ve done ISO 2709, yes, but I only do MARC-XML now. Although it can potentially use up more disk space, and it could be a bit more slow to load, it has a quick benefit for us non-English speakers: it allows to solve encoding issues faster (with the binary, I had to do hexadecimal sed’s and other weird things and it messed up with headers, etc.)

Sometimes I do one record per file (depending on the I/O reality I have to face) but you can do several at a time: a “collection” in just one file, that tends to use up more RAM but also makes it more difficult to pinpoint and solve problems with specific records. I use the bulkmarcimport tool. I make sure the holdings (field 942 in Koha unless you change it) are there before loading, otherwise I really mess up the DB. And my trial/error process usually involves using mysql’s dump and restore facilities and removing the content of the /var/lib/koha/zebradb directory, effectively starting from scratch.

Koha requires indexing, and it can be very frustrating to learn that after you import all your records, you still can’t find anything on the OPAC. Most distro packages for Koha have a helper script called koha-rebuild-zebra which helps you in the process. Actually, in my experience deploying large Koha installations, most of the management and operational issues have something to do with indexing. APT packages for Koha will install a cron task to rebuild Zebra, pointing at the extreme importance (dependency) on this process.

Since Koha now works with instance names (a combination of Zebra installations, MySQL databases and template files) you can rebuild using something like:

koha-rebuild-zebra -b -v -f mybiblio

Feel free to review how that script works and what other (Perl) scripts it calls. It’s fun and useful to understand how old pieces of Koha fit a generally new paradigm. That said, it’s time to embrace cloud patterns and practices for open source ILS – imagine using a bus topic for selective information dissemination or circulation, and an abstract document-oriented cloud storage for the catalogue, with extensive object caching for searches. And to do it all without VMs, which are usually a management nightmare for understaffed libraries.


Considerations for migrating CDS/ISIS databases to fully MARC-based ILSs

CDS/ISIS is an obsolete information storage and retrieval system (and also an information storage format) for computers designed some 30 years ago, filling a need for libraries around the world. For several years UNESCO unfortunately invested time and money supporting it and freely (as in free beer, but as proprietary software) distributing it to several countries. Altogether, CDS/ISIS is now responsible for the overall underdevelopment of technology for libraries, especially in Latin America. Sadly, since UNESCO now seems reluctant to continue draining resources, there is an effort in LatAm to open-source CDS/ISIS-related technologies and bring them to the Web. Fair enough, but this doesn’t change the fact that CDS/ISIS is dead.

So, since it’s already dead, we’ll need to retrieve and migrate our records in CDS/ISIS databases and move them to less ancient systems. Talk about safeguarding our heritage. MARC is an equally ancient format designed by Library of Congress that is actually the standard (ISO 2709) for storing bibliographic records (CDS/ISIS never was) and the flavour we use in LatAm, MARC21, is a binary storage format. But of course we do have MARC-XML which is widespread in Integrated Library Systems, both proprietary and open source. In Koha3 we use MySQL to store MARC-XML when representing a bibliographic records. Specialized open source software such as Zebra allows us to efficiently index and search MARC-XML data.

Perl is the natural language of choice for migrating this kind of data, and there’re libraries for both ISIS (Biblio::ISIS) and MARC (MARC::*) which are already available in Debian, BTW. The following are some caveats I’ve found when migrating data from ISIS to MARC-XML:

  • “Indexing”. Records in CDS/ISIS are referred to using the MFN (master file number) which is a sequential integer asigned by CDS/ISIS; this is useless since end users (patrons) won’t search the catalogue using the MFN, and librarians would like to refer to a single item using a call number. In MARC you don’t have a unique number to refer to records. The whole logic of the MARC::* modules eases understanding, you create an object -your record- containing objects -fields- which you dump in the screen or in a file, all cat’ed together. MARC is a format. Indexing is not the format’s issue.
  • Encoding. Given an MFN, Biblio::Isis throws you a hash. With little manipulation, this hash can be used to create a MARC::Record. So, if librarians have been using MARC fields and subfields in their CDS/ISIS database, migration can be of little logic (search for isis2marc in Google) — however in my scenarios encoding is always a problem. I’d like to cite two of these scenarios: one having a source encoding of cp-850, requiring me to disassemble and then reassemble the whole data structure of Biblio::Isis to create a properly utf-8 encoded record; and the other one having binary garbage coming from a mainframe, where vocals with tilde (spanish) were preceded by a hex 0x82, except for n with tilde, preceded by a hex 0x84 (ibm437) and I preferred to use sed before running my code.
  • Holdings. CDS/ISIS doesn’t implement any logic about your holdings (also called items, or existencias in most spanish-speaking countries) but it might store information about them such as location and number of items. You’re forced to implement custom logic here, since not only your source is picky regarding holdings, but your target will, too. Nowadays, ILSs are expected to be tweakable regarding which MARC field is used for holdings. Koha does use the 952 field.
  • Data quality. In the broadest sense of the term, you’d like to delete multiple space characters, maybe even build a thesaurus, skip undesirable subfields (indicators, subfields under 10) and such. You’ll need custom logic and also disassemble Biblio::Isis data structures. Data::Dumper proves noteworthy for this.

Such procedures are not specifically CPU- or RAM-intensive (I can migrate tenths of MBs of data in my laptop in under two minutes while having a full-fledged desktop running), but they are not instantaneous. With a migration logic which is quite profuse, goes deep inside Biblio::Isis, does decoding/encoding, queries exteal hashes and so, I roughly get a 390 records per second performance. But this is blazingly fast when compared with the time a mode ILS takes to bulk import a huge amount of records (Koha’s bulkmarcimport.pl gives me some 15 records/second) or when comparing with a state of the art indexer such as Zebra (similar times)

Using arrays instead of hashes in MARC::Field constructor

Whenever you’re creating a MARC field using Perl’s excelent MARC::Field module, you can always use an array instead of a hash in order to specify subfield/data pairs. This is useful since some field/subfield combinations in MARC are repeatable, and the POD leads you to writing hashes for storing subfield/data pairs before creating the field. For example, the POD says:

my $field = MARC::Field->new(     245, '1', '0',     'a' => 'Raccoons and ripe co / ',     'c' => 'Jim Aosky.');

But this is also valid and works:

my @array = ('a', 'Raccoons and ripe co /', 'c', 'Jim Aosky.', 'k', 'typescript, 'k', 'diaries');my $field = MARC::Field->new ( 245 , '1', '0', @array );

MARC import/export plugins for GNU EPrints

After a few days of work, I’ve finally published the MARC import/export plugins for the GNU EPrints3 archiving software. Thanks goes to Ailé Filippi, who wrote the MARC to EPrints to MARC again mappings!Writing software that uses MARC records is not easy. There are two mainstream MARC flavours: MARC21 and UNIMARC. Therefore, the software must allow the collection administrator to specify any correspondence between EPrints metadata and MARC fields. This must happen in both MARC input and output, therefore developers must aim for centralized configuration means (Koha, for example, does this in dedicated DB tables)The other problem is that libraries introduce separators in MARC fields so the data recovered through public catalogues pops with the right format. Therefore, when working with MARC developers should trim and add those characters accordingly. It helps if you have a librarian as your girlfriend.Finally, MARC mappings are never one-to-one, since MARC has repeatable fields/subfields and replicated values all around the record. This means that, in Perl, you need to handle data mapping using hashes of array references or hashes of hash references. That’s some beautiful code.The MARC import/export plugins for GNU EPrints3 were developed with these considerations in mind:

  • Writing the least quantity of code for data processing; that is, do not work on each field with punctuation and special characters, since this would require additional logic.
  • Reuse code. For example, the whole MARC-XML support for the plugins uses the MARC::File::XML module from CPAN in order to work with traditional MARC21 records.
  • Centralized configuration. Administrators can use marc.pl in order to specify new mappings, and even write his/her own field processing routines.

Other interesting applications I experimented with were using the (already written) Dublin Core import/export plugins for GNU EPrints 3 and Library of Congress’ unqualified Dublin Core to MARC21 and back crosswalks. I could even use the already written CPAN module for that!