General

Considerations for migrating CDS/ISIS databases to fully MARC-based ILSs

CDS/ISIS is an obsolete information storage and retrieval system (and also an information storage format) for computers designed some 30 years ago, filling a need for libraries around the world. For several years UNESCO unfortunately invested time and money supporting it and freely (as in free beer, but as proprietary software) distributing it to several countries. Altogether, CDS/ISIS is now responsible for the overall underdevelopment of technology for libraries, especially in Latin America. Sadly, since UNESCO now seems reluctant to continue draining resources, there is an effort in LatAm to open-source CDS/ISIS-related technologies and bring them to the Web. Fair enough, but this doesn’t change the fact that CDS/ISIS is dead.

So, since it’s already dead, we’ll need to retrieve and migrate our records in CDS/ISIS databases and move them to less ancient systems. Talk about safeguarding our heritage. MARC is an equally ancient format designed by Library of Congress that is actually the standard (ISO 2709) for storing bibliographic records (CDS/ISIS never was) and the flavour we use in LatAm, MARC21, is a binary storage format. But of course we do have MARC-XML which is widespread in Integrated Library Systems, both proprietary and open source. In Koha3 we use MySQL to store MARC-XML when representing a bibliographic records. Specialized open source software such as Zebra allows us to efficiently index and search MARC-XML data.

Perl is the natural language of choice for migrating this kind of data, and there’re libraries for both ISIS (Biblio::ISIS) and MARC (MARC::*) which are already available in Debian, BTW. The following are some caveats I’ve found when migrating data from ISIS to MARC-XML:

  • “Indexing”. Records in CDS/ISIS are referred to using the MFN (master file number) which is a sequential integer asigned by CDS/ISIS; this is useless since end users (patrons) won’t search the catalogue using the MFN, and librarians would like to refer to a single item using a call number. In MARC you don’t have a unique number to refer to records. The whole logic of the MARC::* modules eases understanding, you create an object -your record- containing objects -fields- which you dump in the screen or in a file, all cat’ed together. MARC is a format. Indexing is not the format’s issue.
  • Encoding. Given an MFN, Biblio::Isis throws you a hash. With little manipulation, this hash can be used to create a MARC::Record. So, if librarians have been using MARC fields and subfields in their CDS/ISIS database, migration can be of little logic (search for isis2marc in Google) — however in my scenarios encoding is always a problem. I’d like to cite two of these scenarios: one having a source encoding of cp-850, requiring me to disassemble and then reassemble the whole data structure of Biblio::Isis to create a properly utf-8 encoded record; and the other one having binary garbage coming from a mainframe, where vocals with tilde (spanish) were preceded by a hex 0x82, except for n with tilde, preceded by a hex 0x84 (ibm437) and I preferred to use sed before running my code.
  • Holdings. CDS/ISIS doesn’t implement any logic about your holdings (also called items, or existencias in most spanish-speaking countries) but it might store information about them such as location and number of items. You’re forced to implement custom logic here, since not only your source is picky regarding holdings, but your target will, too. Nowadays, ILSs are expected to be tweakable regarding which MARC field is used for holdings. Koha does use the 952 field.
  • Data quality. In the broadest sense of the term, you’d like to delete multiple space characters, maybe even build a thesaurus, skip undesirable subfields (indicators, subfields under 10) and such. You’ll need custom logic and also disassemble Biblio::Isis data structures. Data::Dumper proves noteworthy for this.

Such procedures are not specifically CPU- or RAM-intensive (I can migrate tenths of MBs of data in my laptop in under two minutes while having a full-fledged desktop running), but they are not instantaneous. With a migration logic which is quite profuse, goes deep inside Biblio::Isis, does decoding/encoding, queries exteal hashes and so, I roughly get a 390 records per second performance. But this is blazingly fast when compared with the time a mode ILS takes to bulk import a huge amount of records (Koha’s bulkmarcimport.pl gives me some 15 records/second) or when comparing with a state of the art indexer such as Zebra (similar times)

Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s