Rebasing CoreOS for ephemeral cloud storage

The convenience and economy of cloud storage is indisputable, but cloud storage also presents an I/O performance challenge. For example, applications that rely too heavily on filesystem semantics and/or shared storage generally need to be rearchitected or at least have their performance reassessed when deployed in public cloud platforms.

Some of the most resilient cloud-based architectures out there minimize disk persistence across most of the solution components and try to consume either tightly engineered managed services (for databases, for examples) or persist in a very specific part of the application. This reality is more evident in container-based architectures, despite many methods to cooperate with the host operating system to provide cross-host volume functionality (i.e., volumes)

Like other public cloud vendors, Azure presents an ephemeral disk to all virtual machines. This device is generally /dev/sdb1 in Linux systems, and is mounted either by the Azure Linux agent or cloud-init in /mnt or /mnt/resource. This is an SSD device local to the rack where the VM is running so it is very convenient to use this device for any application that requires non-permanent persistence with higher IOPS. Users of MySQL, PostgreSQL and other servers regularly use this method for, say, batch jobs.

Today, you can roll out Docker containers in Azure via Ubuntu VMs (the azure-cli and walinuxagent components will set it up for you) or via CoreOS. But a seasoned Ubuntu sysadmin will find that simply moving or symlinking /var/lib/docker to /mnt/resource in a CoreOS instance and restarting Docker won’t cut it to run the containers in a higher IOPS disk. This article is designed to help you do that by explaining a few key concepts that are different in CoreOS.

First of all, in CoreOS stable Docker runs containers on btrfs. /dev/sdb1 is normally formatted with ext4, so you’ll need to unmount it (sudo umount /mnt/resource) and reformat it with btrfs (sudo mkfs.btrfs /dev/sdb1). You could also change Docker’s behaviour so it uses ext4, but it requires more systemd intervention.

Once this disk is formatted with btrfs, you need to tell CoreOS it should use it as /var/lib/docker. You accomplish this by creating a unit that runs before docker.service. This unit can be passed as custom data to the azure-cli agent or, if you have SSH access to your CoreOS instance, by dropping /etc/systemd/system/var-lib-docker.mount (file name needs to match the mountpoint) with the following:

Description=Mount ephemeral to /var/lib/docker

After systemd reloads the unit (for example, by issuing a sudo systemctl daemon-reload) the next time you start Docker, this unit should be called and /dev/sdb1 should be mounted in /var/lib/docker. Try it with sudo systemctl start docker. You can also start var-lib-docker.mount independently. Remember, there’s no service in CoreOS and /etc is largely irrelevant thanks to systemd. If you wanted to use ext4, you’d also have to replace the Docker service unit with your own.

This is a simple way to rebase your entire CoreOS Docker service to an ephemeral mount without using volumes nor changing how prebaked containers write to disk (CoreOS describes something similar for EBS) Just extrapolate this to, say, your striped LVM, RAID 0 or RAID10 for higher IOPS and persistence across reboots. And, while not meant for benchmarking, here’s the difference between the out-of-the-box /var/lib/docker vs. the ephemeral-based one:

# In OS disk

--- . ( ) ioping statistics ---
20 requests completed in 19.4 s, 88 iops, 353.0 KiB/s
min/avg/max/mdev = 550 us / 11.3 ms / 36.4 ms / 8.8 ms

# In ephemeral disk

--- . ( ) ioping statistics ---
15 requests completed in 14.5 s, 1.6 k iops, 6.4 MiB/s
min/avg/max/mdev = 532 us / 614 us / 682 us / 38 us


Understanding records in Koha

Throughout the years, I’ve found several open source ILS and most of them try to water down the way librarians have catalogued resources for years. Yes, we all agree ISO 2709 is obsolete, but MARC has proven to be very complete, and most of the efforts out there (Dublin Core, etc.) try to reduce the expression level a librarian can have. If your beef is with ISO 2709, there’s MARC-XML if you want something that is easier to debug in terms of encoding, etc.

That said, Koha faces a challenge: it needs to balance the expressiveness of MARC with the rigidness of SQL. It also needs to balance the convenience of SQL with the potential shortcomings of their database of choice (MySQL) with large collections (over a couple thousand records) and particularly with searching and indexing.

Koha’s approach to solve this problem is to incorporate Zebra to the mix. Zebra is a very elegant, but very difficult to understand piece of Danish open source software that is very good at indexing and searching resources that can come from, say, MARC. It runs as a separate process (not part of the Web stack) and it can also be enabled as a Z39.50 server (Koha itself is a Z39.50 consumer, courtesy of Perl)

The purpose of this post is to help readers navigate how records are managed in Koha and avoid frustrations when deploying Koha instances and migrating existing records.

Koha has a very simple workflow for cataloguing new resources, either from Z39.50, from a MARC (ISO 2709 or XML) file or from scratch. It has templates for cataloguing, it has the Z39.50 and MARC capabilities, and it has authorities. The use case of starting a library from scratch in Koha is actually a very solid one.

But all of the libraries I’ve worked with in the last 7 years already have a collection. This collection might be ISIS, Documanager, another SQL database or even a spreadsheet. Few of them have MARC files, and even if they had (i.e., vendors provide them), they still want ETLs to be applied (normalization, Z39.50 validations, etc.) that require processing.

So, how do we incorporate records massively into Koha? There are two methods, MARC import or fiddling with SQL directly, but only one answer: MARC import.

See, MARC can potentially have hundreds of fields and subfields, and we don’t necessarily know beforehand which ones are catalogued by the librarians, by other libraries’ librarians or even by the publisher. Trying to water it down by removing the fields we don’t “want” is simply denying a full fidelity experience for patrons.

But, in the other hand, MySQL is not designed to accommodate a random, variable number of columns. So Koha takes the most used attributes (like title or author) and “burns” them into SQL. For multivalued attributes, like subjects or items, it uses additional tables. And then it takes the MARC-XML and shoves it on a entire field.

Whoa. So what happens if a conservatorium is making heavy use of 383b (Opus number) and then want to search massively for this field/subfield combination? Well, you can’t just tell Koha to wait until MySQL loads all the XMLs in memory, blows them up and traverse them – it’s just not gonna happen within timeout.

At this point you must have figured out that the obvious solution is to drop the SQL database and go with a document-oriented database. If someone just wants to catalog 14 field/subfields and eventually a super detailed librarian comes in and starts doing 150, you would be fine.

Because right now, without that, it’s Zebra that kicks in. It behaves more like an object storage and it’s very good at searching and indexing (and it serves as Z39.50 server, which is nice) but it’s a process running separately and management can sometimes be harsh.

Earlier we discussed the use case where Koha excels: creating records from scratch. Does this mean that Koha won’t work for an existing collection? No. It just means the workflows are a tad more complicated.

I write my own Perl code to migrate records (some scripts available here, on the move to GitHub), and the output is always MARC. In the past I’ve done ISO 2709, yes, but I only do MARC-XML now. Although it can potentially use up more disk space, and it could be a bit more slow to load, it has a quick benefit for us non-English speakers: it allows to solve encoding issues faster (with the binary, I had to do hexadecimal sed’s and other weird things and it messed up with headers, etc.)

Sometimes I do one record per file (depending on the I/O reality I have to face) but you can do several at a time: a “collection” in just one file, that tends to use up more RAM but also makes it more difficult to pinpoint and solve problems with specific records. I use the bulkmarcimport tool. I make sure the holdings (field 942 in Koha unless you change it) are there before loading, otherwise I really mess up the DB. And my trial/error process usually involves using mysql’s dump and restore facilities and removing the content of the /var/lib/koha/zebradb directory, effectively starting from scratch.

Koha requires indexing, and it can be very frustrating to learn that after you import all your records, you still can’t find anything on the OPAC. Most distro packages for Koha have a helper script called koha-rebuild-zebra which helps you in the process. Actually, in my experience deploying large Koha installations, most of the management and operational issues have something to do with indexing. APT packages for Koha will install a cron task to rebuild Zebra, pointing at the extreme importance (dependency) on this process.

Since Koha now works with instance names (a combination of Zebra installations, MySQL databases and template files) you can rebuild using something like:

koha-rebuild-zebra -b -v -f mybiblio

Feel free to review how that script works and what other (Perl) scripts it calls. It’s fun and useful to understand how old pieces of Koha fit a generally new paradigm. That said, it’s time to embrace cloud patterns and practices for open source ILS – imagine using a bus topic for selective information dissemination or circulation, and an abstract document-oriented cloud storage for the catalogue, with extensive object caching for searches. And to do it all without VMs, which are usually a management nightmare for understaffed libraries.

Surviving NYE: Times Square

Ailé and I spent the last week of 2013 travelling in the United States’ Northeast. We had decided on spending December 31st in New York City, but we wondered whether receiving 2014 in Times Square would be a sane decision. But we did it, and we loved it. And in the process, we noticed that a lot of blogs and webpages, as well as most of the people we met in New York City, actually discouraged it. So in true Internet spirit, we’ll share our learnings from a great evening with a million of our friends from around the world.

Spending New Year’s Eve in Times Square means different things for different people. Coming to terms with your expectations is the first step in this process. Is seeing the ball drop your main expectation? Or is it to be on TV? Or is it to receive party favors? Or to see fireworks? Or getting drunk with your friends? Or enjoy the artists that play during the evening? I would say that if any of this is a priority for you, then probably spending the evening in Times Square like we did may not the best option for you.

We committed to spending the evening, and receiving the new year in Times Square because we wanted to be part of it: being with strangers from around the globe, feeling the energy of 1 million people and just being dazzled by Times Square. We didn’t care much about Macklemore or Miley Cyrus, or about the fireworks, or even about the ball itself which frankly is a very small object when compared to the screens and everything else in Times Square.

With this in mind, there are options for everyone. You can reserve early in the year for one of the few rooms with a view, you can pay from a couple hundred to a couple grand for one of the parties (with no actual view), you can spend it elsewhere in New York City, like in Central Park, or, you can do as we did and join the public event in Times Square, the one that is actually broadcast on TV, and the one attended by some 40% of the people that are in Manhattan. Keep reading if you want to learn more about this option.

Surprisingly, there are few sources of information about what exactly happens in Times Square in NYE. One of the most complete sources is the Times Square Alliance which has a useful FAQ and a discussion on NYE parties and tickets, etc., but also a wealth of discussion on the actual public event. It might be worthy to monitor local news channels sites 72 hours before, as well as sources like Twitter.

But, since we found a lot of New Yorkers didn’t actually attend the party in Times Square, there are few first person accounts of how exactly this goes. So we’ll try to explain from our experience, and hopefully provide some useful insights if you ever plan to do this.

The first question is when to arrive. We arrived at 2 PM. By that time they had closed some 3 blocks and we were at 48th. Street. After we got in they closed 49th, 50th and so on all the way up to Central Park. This means we spent 10 hours there. After the ball drops, confetti rains and both Auld Lang Syne and New York, New York plays, the NYPD will allow people to leave. This is around 12:10 AM.

How do you enter the event? The event actually happens on Broadway and 7th. Avenue, so NYPD will close cross streets and put entry points on the 6th. and 8th. Avenues. Our recommendation is to take the subway and exit on the 50th or even the 59th, walk towards Times Square on either the 8th or the 6th Avenue and try to get in as up front as you can get. The worst option is to exit on the 42nd. and start walking north to try and find an entry point.

Once you get in, you’ll see pens set up and depending on the time, NYPD might have started allowing people in. There will be a metal detector wand and a bag check. Then you’re free to go stand wherever inside the pen you want. Our suggestions are to secure a spot where you can lean against the fences. Up front is best, because you have uninterrupted view. The fence helps to cope with the crowds pushing, you can lean on it, you can sit and spread your legs, etc.

Notice the crowd will be a living organism. People will start leaving when they realize they have to stand for 10 hours. People will definitely leave to use restrooms (nowhere to be found) only to discover that they can’t come back to the pen. All this will result in natural crowd movements. Every time someone leaves the crowd will push to the front, and depending on what NYPD says, even move to the pen in front of yours. Try to anticipate. Choose good crowd neighbors at least for the first couple hours. Be polite, smile!

Lack of restrooms does not necessarily mean you will stand on a puddle of urine and feces from other attendees, as some blogs say. We sat on the pavement most of our 10 hours there and we never had to cope with such a situation. Note we were in the front of our pen. The pavement is very cold, though. More on cold below.

For most of the evening there will be people selling pizza, water and hot chocolate. Drinks are a no-no because there are no restrooms, and pizza is a no-no because it will make you thirsty. If you have to, it’s something like $20 and I suggest you wait until late-ish 10 PM. Don’t wait TOO long, though, because NYPD will enhance security as midnight approaches and this includes not letting pizza to be sold. We had just one bottle of water reserved for both of us, had a pizza before 11 PM and had a few sips of water just before midnight. We also had granola bars for earlier in the afternoon.

The evening was very cold. We actually had something similar to snow for a few minutes when the sun was still up and then we had the fortune of a clear but actually very cold evening, around the -4C or so. Preparation was key. I had thermal underwear and snow pants, and then a thermal shirt, a T-shirt, a sweater, a synthetic fleece jacket and a heavy fleece jacket on top. I had ear muffs, a thermal hat and thermal gloves. We had double wool socks and both hand and feet warmers (feet warmers are not awesome, but hand warmers are amazing) Also, you will get a runny nose.

You have to think how you will kill 10 hours. I had bought a couple e-books to read on my Kindle app, but I did not read as much as I wanted as the gloves were not touch-ready. I had to control the phone with my nose. We talked a lot to each other (most of our crowd neighbors were Chinese, Japanese or Korean) and listened to music for a while. AT&T data service was not bad for such a big crowd. The event officially starts at 6 PM, and they will keep you entertained with some stuff like the sound checks, an hourly countdown, some videos, etc. We really liked the NASA NYE Video and the AP 2013 Review.

We learned some interesting things. For example, we knew that some people from some parties were allowed to go out in “expanded pens” 15 minutes before the ball dropped. They just didn’t knew which lucky ones were going to be allowed to do so. Also some bystanders were allowed entrance some 40-50 seconds before the ball dropped so if you just wanted to see the confetti and take a quick picture before the crowds were released, that’s also an option.

Going back to sleep requires preparation, too. Going underground is impossible and so is taking a cab. We just had committed to walk to Columbus Circle, but we were actually surprised that most people were walking towards Times Square and not away from it. So trying to walk against the crowd and within the NYPD barricades was a bit awkward. We did end up in Central Park near the Bolívar statue which was a photo-op for Ailé, and then were surprised that the Columbus Circle station for the Uptown 1 was not crowded.

And finally, was it worth it? Just take a look at this or this. It was absolutely worth it!

Thoughts on wearable APIs and measly data economy

In the high tech industry we continuously reach peak breadth of business models, only to discover a new way of monetizing stuff. Just some 5 years ago it was licensing, ads, subscriptions, monthly/yearly fees, hardware margins, accessories, paid apps (and soon in-app purchases), consulting fees, bundles, training fees, carrier subsidies and maybe a handful of other more that were indicative of the diversity of business models available for creators.

However, a lot of the variations were built on top of the existing foundations that go back to the 90s or earlier: licensing fees, services fees, content fees.

One really disruptive model was driven by social computing and data explosion, and it is the API economy. The fact that people that spent OPEX in a library or research center getting numbers or CAPEX buying research from others are now investing their money in consuming APIs and data markets across the globe is a good example of the sophistication and idealization level that the knowledge society has achieved.

With wearables, we are being presented “our numbers” in devices that have become socially acceptable enough that we can wear them. There are accurate and small sensors for time, position, distance, temperature, steps, calories, heart rate, cadence and many others. I find it realistic to think than in less than a couple years we will also have small, fashionable devices that you can attach to the sides of your head and add EKG signals.

But one important part of wearables is the interaction among wearables: an API for your self. Provided you are willing, your wearables can share and request data from others, like your current mood, cultural and personality traits, preferences for this day (food, activities, agenda) and present it to the user in an assertive way to improve relationships, productivity and effectiveness.

Your wearables can also gather environmental data that, either aggregated or in a context, can be transacted for market value. Let’s say you are the first one off a very hot subway train and you find a refreshment vendor on the street. You ask for a Coke. Your wearables can interact in fractions of a second, the vendor’s system can get this data from you, prepare an impromptu marketing campaign for the people that are coming behind you and in return give you a Coke for free.

This also has the potential to change how we do primary consumer research for marketing. No longer I need to worry about predisposition of survey takers or the inherent entropy of a sample. If your wearable knows that when you do groceries you just fly by the Doritos, not even look at them, it can sell this “measly” data to researchers (someone observing you over CCTV can get the same information, at a higher cost) This also can happen automatically so you just walk around in your day and by the end of the day you would have earned some money just by being who you are.

One important blocker for all this is payment mechanisms. I wrote about the value of Bitcoin as a payment mechanism, and I recently had an experience in Vancouver where I spent 30 minutes on the phone and spent tenths of dollars on roaming charges to talk to my bank and be able to withdraw Canadian dollars from an ATM whereas I only spent 7 minutes on a Bitcoin ATM to buy 20 CAD worth of Bitcoin (which are now worth 30 CAD, although those fluctuations are normal in BTC and can also be negative growth in some currencies and securities)

And of course… the creepy factor. Such an idea requires a granularity on policies, not only the boring tl;dr “privacy policies” (that are one of the results of ads in high-tech) but interaction policies. A new social order and interaction rules. Clinical medicine still technically considers our technology relationships as insanity. How would we change this?

I see it being either a highly personal or a highly anonymized interaction. If I’m meeting with you on a 1:1, I’m willing to share with you my mood, whether I have hard stops or pending work that I’d rather do, and how I would like to set lightning, music and climate based on my experience today. I can even share with you what my current satisfaction level with our relationship or this specific project is so we can set expectations. This can work around the cultural traits in multidiverse work environments. And you would derive some of this from non-conversational queues anyway. Otherwise, I’d rather have a Jason Bourne like, highly anonymized interaction. “Someone” walked by a market researcher and “left” this data. Nothing that couldn’t have been also obtained by observing you on CCTV or following you via mesh networks or whatever.

That said, I think it’s pretty realistic to think that within 5 years such an economy would be sailing ahead.
But if the current state of the communications infrastructure, financial system and payment mechanisms can’t enable someone to pay by tap, chip or band in fractions of a seconds, but rather in several seconds (and I still see places where Internet access is being “solved” only by cell phone towers or satellite), an API for your self and a market for derisory metrics would be too cumbersome to implement. It just wouldn’t flow.

Instant Debian – Build a Web Server is now available

During the past few months I worked on a book project with Packt to make it easier for people new to Debian to leverage it for Web-based applications. I’m happy to announce that Instant Debian – Build a Web Server is now available. Although it is not my first project with Packt (I’ve reviewed some nginx books before) it is the first one that I’m authoring, and I’m already working on some new projects.

I had the fortune of having a senior leader that I deeply respect from the Debian Project as my technical reviewer, and the full support of the Packt team. The motivation for the book is simple: in a world of elastic clouds, simpler NoSQL’s and explosive growth, developers, sysadmins and business leaders are less concerned about the operating system and more about their time-to-market. In this book, I use my 10-year experience with Debian to provide a simpler path to a solid Web platform.

In fact, all of my immediate writing projects are related to most of those low-hanging fruits that add incredible value to business decision makers in the broader technology conversations of todays: elasticity, information security and privacy and performance. In a way, this book answers the “why” I get from the business side when explaining technical decisions related to Debian: why use noexec in /tmp, why use codenames in sources.list for APT, why use sudo, etc. – only with a goal: reduce time-to-market.

This is a beginner’s book. If you haven’t heard about Debian before, and would like to leverage virtualization or cloud technologies to create a “template” for your Web deployment, Instant Debian – Build a Web Server will provide exactly that, while exploring the rationales and laying a solid foundation for you to continue exploring the system.