Why I’m writing about hackers and parenting in Spanish

Earlier this month I went live with a self-published e-book in Spanish titled Cómo criar un hacker: how to raise a hacker.

This book is aimed at parents who are not IT professionals but wish to understand the hacker ethos and a key set of things they can do to build that ethos that is more than just teaching kids to code.

Sure, in the book I say it’s OK if your child isn’t interested in coding, I mention 2600 and Club Mate and I quote ESR profusely (largely for his authority on this topic, and not because I validate or agree with every position of his) but I hope all of that isn’t controversial because it isn’t at the core of the motivations for the book.

Instead, I challenge the inconsistent and unjustified battle against screens, I highlight the relevance of open source and I stake a claim for hackers on note taking and humor as a tool for building character and I do this all under the assumption that digital natives (whatever that means) aren’t set for excellence in a generation where every world leader is also a digital native: only hackers are.

Since this is a short book on parenting, I evidently don’t go into building actual infosec skills at any significant depth. I don’t even discuss the tactics of screen management, covered eloquently and empathetically in books like Screenwise.

In fact, it was by reading reviews on similar work that I found a huge gap in Spanish-speaking markets on this and decided to start there (plus, self-publishing in English would probably have taken me longer as I’m not a native English speaker)

I’m ultimately neither a family counselor, not an expert on parenting. I was just raised a hacker by folks who didn’t even know what that was. Feedback is always welcome. (Cómo criar un hacker is available on Amazon and Smashwords)

Perspective at //build

Recently I had the opportunity to share stage with some brilliant internal and external colleagues advancing open source in the cloud at //build, Microsoft’s developer conference in San Francisco. Beyond having been able to talk to about 400 attendees about how we’re approaching open source in the cloud, how customers are building open source applications in Azure and much more, speaking at //build had a very special meaning for me.

Before joining Microsoft, I didn’t have a lot of exposure to the Microsoft developer ecosystem, the Microsoft subsidiaries themselves or the employees working there. I was focused in a number of open source projects such as Canaima (Venezuela’s national distro) a number of communities and expanding a small open source system integrator in the region.

Most of my interaction with Microsoft was limited to public debates, at industry events or in Congress, or to the ISO/IEC 29500 discussion back in the days (both of which I’ve covered in this blog, in Spanish) However, around 2009 or so, the company I was a CTO for and Microsoft decided to create an Open Source Interoperability Lab in Venezuela. The idea was to document common hybrid technology use cases (such as Samba-based DCs in Windows environments, or PHP and ASP.NET communicating via ESB) and transfer that knowledge to customers.

As a result of that effort, I ended up being invited to and participating in PDC09 in Los Angeles. PDC was the precursor of //build, a yearly conference aimed at Microsoft-centric developers. There are 3 things I remember clearly from PDC09: one, was the “convertible tablet PC” they offered attendees (running Windows 7 bits that rapidly became Debian bits), the second one was the PHP SDK for Azure, and the preview access to that new “cloud” thingy, and the third one was an open source roundtable led by Miguel de Icaza that mainly talked about governance and CodePlex.

While I didn’t know it back then, a lot of the things discussed in that roundtable influenced my decision, about a year later, to join Microsoft and work in open source strategy; a journey that brought me to Azure in less than 5 years. But I digress, and that whole story deserves another post.

Maybe some of the attendees then foresaw that Microsoft would end up acquiring Xamarin, or that attention would be put in non-CodePlex initiatives, like GitHub. What I really didn’t expect was that all of that new reality would converge into a PDC-like event, less than 10 years after. This year at //build it did, and then some.

For me, speaking at //build was a humbling opportunity to reconcile the many worlds increasingly pulled together by the force of open source. From the announcements to the content and all other metasignals at the conference, it was incredibly exciting to see this transformation manifesting itself within Microsoft’s developer community.

It highlights the importance of leaving no one behind when we explore new paradigms and technologies in the cloud, and how every individual in the open source community can exert change in this industry.

Why we go to LinuxFest Northwest

For the second year in a row since I moved to Redmond, I’ll be joining the Microsoft crew sponsoring and attending LinuxFest Northwest in Bellingham, Washington. This is one of the largest, if not the largest Linux & open source event in the region and draws large crowds of smart geeks from Canada, the United States and other countries, as well as corporate sponsors like us.

One of the questions I get the most is why does Microsoft sponsors and participates this event? Microsoft has been sponsoring and participating in many open source conferences, projects and events in many parts of the world but some people are wondering why a non-corporate, pure Linux event, and some others are naturally skeptical about it.

I don’t think there’s a single reason why we rally to convince our bosses to do it, but we have been trying to do more closer to home, when it comes to open source. There is a vibrant Linux and open source ecosystem in Redmond, the Puget Sound area and the Pacific Northwest and while we have been very active in Europe and in the Bay Area, we haven’t done a good job of connecting with the people closer to home.

For example, I recently had the fantastic opportunity to help the Pacific Northwest Seismic Network from the University of Washington to run their Ubuntu-based Node.js applications for their “Quake Shake“. I think being able to help with that project or with any other project or conference in any other part of the globe is a good thing – but there’s no distance excuse for Bellingham!

Another great reason is the LFNW community itself. We love the crowd, the lively discussions, the sharing and learning spirit. And as long as we are welcome by the community we’ll continue to seek opportunities to connect with it. Plus, this is a really cool conference. This year, I’m cutting my vacations to attend the event. A coworker is skipping church duty to help. We have heard from many engineers and program managers that they will be attending and want to carpool and staff the booth. And my friend has been investing all this time in logistic ensuring we are having a meaningful presence.

The community invites some of the sponsors to bring unique content that is relevant to the participants. Last year I had the opportunity to demo a Raspberry Pi device connected to Office via Azure. Most people in the room didn’t know Office runs in a browser, or that Azure could run Linux. But they listened and they thought it was cool. Some of them are now partners, helping customers do more with open source in Azure.

This year, I want to bring more Debian to this event because I have been working a lot inside of Microsoft to get more people up to speed with Debian-based development and we have serious community momentum around Debian in Azure. In true Microsoft tradition, we will have a cake to celebrate the arrival of Debian 8. I’ll have in mind all of those friends in the Debian community with whom I’ve been working with for years to make sure we don’t drop the ball when it comes to responding to what our customers, partners and the community want when it comes to Debian.

And, hopefully, next year we’ll be back again in Bellingham for LinuxFest Northwest 2016!

Rebasing CoreOS for ephemeral cloud storage

The convenience and economy of cloud storage is indisputable, but cloud storage also presents an I/O performance challenge. For example, applications that rely too heavily on filesystem semantics and/or shared storage generally need to be rearchitected or at least have their performance reassessed when deployed in public cloud platforms.

Some of the most resilient cloud-based architectures out there minimize disk persistence across most of the solution components and try to consume either tightly engineered managed services (for databases, for examples) or persist in a very specific part of the application. This reality is more evident in container-based architectures, despite many methods to cooperate with the host operating system to provide cross-host volume functionality (i.e., volumes)

Like other public cloud vendors, Azure presents an ephemeral disk to all virtual machines. This device is generally /dev/sdb1 in Linux systems, and is mounted either by the Azure Linux agent or cloud-init in /mnt or /mnt/resource. This is an SSD device local to the rack where the VM is running so it is very convenient to use this device for any application that requires non-permanent persistence with higher IOPS. Users of MySQL, PostgreSQL and other servers regularly use this method for, say, batch jobs.

Today, you can roll out Docker containers in Azure via Ubuntu VMs (the azure-cli and walinuxagent components will set it up for you) or via CoreOS. But a seasoned Ubuntu sysadmin will find that simply moving or symlinking /var/lib/docker to /mnt/resource in a CoreOS instance and restarting Docker won’t cut it to run the containers in a higher IOPS disk. This article is designed to help you do that by explaining a few key concepts that are different in CoreOS.

First of all, in CoreOS stable Docker runs containers on btrfs. /dev/sdb1 is normally formatted with ext4, so you’ll need to unmount it (sudo umount /mnt/resource) and reformat it with btrfs (sudo mkfs.btrfs /dev/sdb1). You could also change Docker’s behaviour so it uses ext4, but it requires more systemd intervention.

Once this disk is formatted with btrfs, you need to tell CoreOS it should use it as /var/lib/docker. You accomplish this by creating a unit that runs before docker.service. This unit can be passed as custom data to the azure-cli agent or, if you have SSH access to your CoreOS instance, by dropping /etc/systemd/system/var-lib-docker.mount (file name needs to match the mountpoint) with the following:

Description=Mount ephemeral to /var/lib/docker

After systemd reloads the unit (for example, by issuing a sudo systemctl daemon-reload) the next time you start Docker, this unit should be called and /dev/sdb1 should be mounted in /var/lib/docker. Try it with sudo systemctl start docker. You can also start var-lib-docker.mount independently. Remember, there’s no service in CoreOS and /etc is largely irrelevant thanks to systemd. If you wanted to use ext4, you’d also have to replace the Docker service unit with your own.

This is a simple way to rebase your entire CoreOS Docker service to an ephemeral mount without using volumes nor changing how prebaked containers write to disk (CoreOS describes something similar for EBS) Just extrapolate this to, say, your striped LVM, RAID 0 or RAID10 for higher IOPS and persistence across reboots. And, while not meant for benchmarking, here’s the difference between the out-of-the-box /var/lib/docker vs. the ephemeral-based one:

# In OS disk

--- . ( ) ioping statistics ---
20 requests completed in 19.4 s, 88 iops, 353.0 KiB/s
min/avg/max/mdev = 550 us / 11.3 ms / 36.4 ms / 8.8 ms

# In ephemeral disk

--- . ( ) ioping statistics ---
15 requests completed in 14.5 s, 1.6 k iops, 6.4 MiB/s
min/avg/max/mdev = 532 us / 614 us / 682 us / 38 us

Understanding records in Koha

Throughout the years, I’ve found several open source ILS and most of them try to water down the way librarians have catalogued resources for years. Yes, we all agree ISO 2709 is obsolete, but MARC has proven to be very complete, and most of the efforts out there (Dublin Core, etc.) try to reduce the expression level a librarian can have. If your beef is with ISO 2709, there’s MARC-XML if you want something that is easier to debug in terms of encoding, etc.

That said, Koha faces a challenge: it needs to balance the expressiveness of MARC with the rigidness of SQL. It also needs to balance the convenience of SQL with the potential shortcomings of their database of choice (MySQL) with large collections (over a couple thousand records) and particularly with searching and indexing.

Koha’s approach to solve this problem is to incorporate Zebra to the mix. Zebra is a very elegant, but very difficult to understand piece of Danish open source software that is very good at indexing and searching resources that can come from, say, MARC. It runs as a separate process (not part of the Web stack) and it can also be enabled as a Z39.50 server (Koha itself is a Z39.50 consumer, courtesy of Perl)

The purpose of this post is to help readers navigate how records are managed in Koha and avoid frustrations when deploying Koha instances and migrating existing records.

Koha has a very simple workflow for cataloguing new resources, either from Z39.50, from a MARC (ISO 2709 or XML) file or from scratch. It has templates for cataloguing, it has the Z39.50 and MARC capabilities, and it has authorities. The use case of starting a library from scratch in Koha is actually a very solid one.

But all of the libraries I’ve worked with in the last 7 years already have a collection. This collection might be ISIS, Documanager, another SQL database or even a spreadsheet. Few of them have MARC files, and even if they had (i.e., vendors provide them), they still want ETLs to be applied (normalization, Z39.50 validations, etc.) that require processing.

So, how do we incorporate records massively into Koha? There are two methods, MARC import or fiddling with SQL directly, but only one answer: MARC import.

See, MARC can potentially have hundreds of fields and subfields, and we don’t necessarily know beforehand which ones are catalogued by the librarians, by other libraries’ librarians or even by the publisher. Trying to water it down by removing the fields we don’t “want” is simply denying a full fidelity experience for patrons.

But, in the other hand, MySQL is not designed to accommodate a random, variable number of columns. So Koha takes the most used attributes (like title or author) and “burns” them into SQL. For multivalued attributes, like subjects or items, it uses additional tables. And then it takes the MARC-XML and shoves it on a entire field.

Whoa. So what happens if a conservatorium is making heavy use of 383b (Opus number) and then want to search massively for this field/subfield combination? Well, you can’t just tell Koha to wait until MySQL loads all the XMLs in memory, blows them up and traverse them – it’s just not gonna happen within timeout.

At this point you must have figured out that the obvious solution is to drop the SQL database and go with a document-oriented database. If someone just wants to catalog 14 field/subfields and eventually a super detailed librarian comes in and starts doing 150, you would be fine.

Because right now, without that, it’s Zebra that kicks in. It behaves more like an object storage and it’s very good at searching and indexing (and it serves as Z39.50 server, which is nice) but it’s a process running separately and management can sometimes be harsh.

Earlier we discussed the use case where Koha excels: creating records from scratch. Does this mean that Koha won’t work for an existing collection? No. It just means the workflows are a tad more complicated.

I write my own Perl code to migrate records (some scripts available here, on the move to GitHub), and the output is always MARC. In the past I’ve done ISO 2709, yes, but I only do MARC-XML now. Although it can potentially use up more disk space, and it could be a bit more slow to load, it has a quick benefit for us non-English speakers: it allows to solve encoding issues faster (with the binary, I had to do hexadecimal sed’s and other weird things and it messed up with headers, etc.)

Sometimes I do one record per file (depending on the I/O reality I have to face) but you can do several at a time: a “collection” in just one file, that tends to use up more RAM but also makes it more difficult to pinpoint and solve problems with specific records. I use the bulkmarcimport tool. I make sure the holdings (field 942 in Koha unless you change it) are there before loading, otherwise I really mess up the DB. And my trial/error process usually involves using mysql’s dump and restore facilities and removing the content of the /var/lib/koha/zebradb directory, effectively starting from scratch.

Koha requires indexing, and it can be very frustrating to learn that after you import all your records, you still can’t find anything on the OPAC. Most distro packages for Koha have a helper script called koha-rebuild-zebra which helps you in the process. Actually, in my experience deploying large Koha installations, most of the management and operational issues have something to do with indexing. APT packages for Koha will install a cron task to rebuild Zebra, pointing at the extreme importance (dependency) on this process.

Since Koha now works with instance names (a combination of Zebra installations, MySQL databases and template files) you can rebuild using something like:

koha-rebuild-zebra -b -v -f mybiblio

Feel free to review how that script works and what other (Perl) scripts it calls. It’s fun and useful to understand how old pieces of Koha fit a generally new paradigm. That said, it’s time to embrace cloud patterns and practices for open source ILS – imagine using a bus topic for selective information dissemination or circulation, and an abstract document-oriented cloud storage for the catalogue, with extensive object caching for searches. And to do it all without VMs, which are usually a management nightmare for understaffed libraries.