When process becomes ritual
There's an old joke about turkeys. A college student celebrates his first Thanksgiving away from home. He prepares a turkey for his dorm mates just as his mother did. The turkey turns out excellent but someone asks, "Why did you cut off the drumsticks?" "That's the way mom always did it," replies the student. That night he calls his mom to ask about it. She informs, "Well, that's the only way it would fit into the oven."
The lesson here is that a process has become ritualized. The original rationale has been forgotten but we continue a process because that's the way it has always been done.
This poses an interesting problem for IT folks. On the one hand we must be wary that our processes do not devolve into rituals. On the other, we are often asked to standardize our products and our processes and, in a sense, are forced to ritualize our procedures ("Just do it. Don't ask why.") Or to be less facetious, "Here, run this Python script before you start the application."
The reason for the latter approach can be complex. Perhaps we are turning over a process to a first tier resource or an end user. They may not grasp or care about the Why but can perhaps understand the How. And the thing is, if we pay heed to industry best-practice, we will obscure the Why. We will reduce the appearance of complexity of our processes by hiding it behind orchestration and clever front-ends.
And that's where a danger lies. To be able to understand the Why of many processes, one must often be intimately familiar with the technical aspects of the problem and down that path lies madness. It's like trying to measure the coastline*. Indeed, it is a problem of complexity because the closer we inspect a process, the more details we'll need to grasp. When trying to build a website we cannot necessarily worry about how the underlying LVMs are assembled at the LUN level.
In fact, just such a thing occurred the other day. We were tasked with developing a backup/restore process for a set of cloud-based systems. On the front of it the process is straightforward: Each night, run a script that snapshots the underlying volumes. This approach works beautifully on our on-premise environment. From the OS support standpoint, it also met our requirements.
Alas, inspecting the coastline revealed some other problems. For one, cloud snapshots across multiple volumes are not guaranteed to run synchronously. The script may submit the snapshot job each volume within a second or two but that second can wreak all sorts of merry havoc with Logical Volumes spread across multiple Physical Volumes. I.e., exactly what we had in the cloud environment.
The solution turns out to rely on the Operating System to make these snapshots at the LVM layer. Oh no, weren't folks saying that the OS is irrelevant in the cloud?!
And this of course leads to a larger question: Should we be doing system-level backups in the cloud? That sounds like another ritual to me.
My point here is not to say that we throw out our processes in the name of change. Rather, we must be diligent (and hopefully rigorous) in how we decide what processes and standards move forward as the IT environment changes.
Thanks for reading.
Measuring the coastline refers to the Coastline Paradox. The better one measures, the longer is the coastline. Infinitely long, in fact. (http://en.wikipedia.org/wiki/Coastline_paradox)
Occam's razor and IT
Ever heard of Ockham's Razor? Of course you have. No, it's not a new gadget that will topple the billion dollar disposable shaving industry, but a principle.
Except that many get it wrong..
Many think Ockham's Razor is, "Simpler is better." Alas, it's not quite so simple. Wikipedia summed it up with:
Among competing hypotheses, the one with the fewest assumptions should be selected.
It's not saying that simpler is better, but that given alternate hypotheses that satisfy constraints, the simplest one should be selected. "Simplify, but not too much," in other words.
Years ago it was expected that a system administrator wore many hats. We managed the mail services, maintained file shares, added storage, ran backups, wrote usage reports. In the fledgling days of the Internet, we were even web masters and programmers as the need arose.
As IT grew, specialization naturally occurred. Things that were once just a routine part of an administrator's week became specialized disciplines. Maybe the admin who was particularly adept with tuning the NICs on a Sun E6500 became a network engineer. Maybe the operator who helped update the procmail rules and knew sendmail became a messaging solutions expert.
Today, some of the old school admins (the BOFHs, the sysops) look at these highly specialized titles and chuckle.
And that's a mistake.
The IT infrastructure has become several orders of magnitude more complex than it was in the past. The amount of change in a year is so great that -- to be on top of our fields -- we are constantly in class and constantly learning and experimenting with technology.
If we're really good, we make the complexities of our discipline transparent to our peers. If we manage the OS, we are thankful that there are storage experts who worry about MTBF rates and logarithmic growth year over year. We are glad that the business continuity experts figure out how to recover a server when an application owner completely neglected to plan for backups. Heck, I'm grateful that someone else figures out optimal MTUs and wires together disparate networks riding on competing vendors and technologies. These are just the complexities I know about. The unknown unknowns are thankfully hidden away from me by other engineers.
And there's the rub...
These really good engineers, by making the complexities seem simple, may convince others that what they do is indeed simple. And we all know what happens next. During a planning meeting someone dismisses the complexity. "I can buy a 2TB disk for $100. Why does my storage cost $2K?" Or, "Why can't you load Ubuntu? You can download it for free."
In other words, they've simplified too much.
There are no easy solutions. DevOps and CI and Scrum and other methods can be used to foster communication between different specializations. They can work but are highly dependent on corporate culture.
Most importantly, I realize that respect for one's peers is the only way to manage the complexity without over-simplifying. Assume that what the others do is as equally complex as what I do and where I think it may be simple, assume that they may be going to great lengths to make that complexity manageable to others outside their discipline.
Brave new cloud
(Originally published October 2014)
Like many other Linux admins, we're patching systems for the Shellshock vulnerability.
I'm sort of old and doddering with a wont to reminisce about the good old days (back in January 2014, say) before Clouds and DevOps and IoT. Back then we had standards and best practices honed over decades: least privilege, defense in depth, layered security, DMAIC methodologies, etc..
Now I keep hearing that the OS is pretty much irrelevant. It's just an extension of the Application. Those controls from back in the day (January, 2014 as I mentioned) don't really apply anymore. Why add LVMs and firewalls and configuration agents and monitoring when all those things are really just to support an OS?
Then we get another Shellshock.
That command first runs whoami to find out the name of the user running the web server. That's especially useful because if the web server is being run as root (the superuser who can do anything) then the server will be a particularly rich target.
That's from another website that discusses attack vectors for the vulnerability. In the olden days we required applications to never run as the root user. Users complained. A regular account just does not have the permissions we need. So we worked with them, chmod'ed a few directories, added some file capabilities, maybe set a few ACLs. It took some coordination between the teams but we got it working. We even automated the process so that future builds automatically had our somewhat complex requirements in place.
Then cloud methodologies roll out. It's faster. It's arguably better. It takes advantage of the horde of itty bitty machines in massive, Matrix-like pod farms.
We can now spin up an entire server faster than we could add a user a few years ago.
Suddenly the extra coordination required to meet our old baseline requirements and still spin out applications at Interweb speeds seems a throwback to that bygone era of relational databases and redundant power supplies.
So they run the applications as root. And because of this they're more vulnerable to inevitable bugs like Shellshock.
Don't get me wrong. I'm not arguing to retain all legacy processes. I don't agree with managing endpoint authentication. I think template approaches to managing virtual systems is very much a holdover from the 1990s.
Thing is, systems are complex. We often hear, "Simpler is better," but that's not quite true. The goal is not to simplify but to simplify while meeting complex requirements. Our users often don't understand why our processes are in place. Because of this they simplify and often misunderstand. It is human nature and why folks still believe that "we only use 10% of our brain" or that Occam's razor applies generally to all problems. Or that running an application as root is perfectly acceptable because "Cloud".
Configuring the OS is not simple but that does not mean it cannot be automated. The danger occurs when we leave the configuration of complex systems to those who do not understand those complexities.
Cloud is not about raw speed but about automating complex environments so the product is reproducible and auditable and meets requirements from disparate areas.
DevOps is not about application adminstrators taking over the role of the system engineers but that both work as a team to produce a cohesive product.
Anyhoo, the point of this rant is to say that even in this Cloud and Nebula and DevOps age, we can't throw out proven processes so easily. The OS still matters.
(Originally published Februaru 2019)
Over the long President's Day weekend, I decided to learn Go. The Go Programming Language (golang) is a relatively recent language designed to improve programming productivity. Highly performant, it has features similar to C or Java without requiring a complex compiler toolchain and build environment. Though intended for multi-core concurrency and massive codebases, these features can make it ideal for infrastructure tasks.
But there's Python. For several years, Python (and an occasional Bash script) has been my de facto language for everything from ad hoc reporting scripts to managing cloud infrastructure to scheduled infrastructure jobs that copy, deploy, audit and maintain the environment. It has performed admirably in this role and has saved me countless hours.
So why look elsewhere?
A little over ten years ago, Perl was my de facto language for everything from ad hoc reporting scripts to managing VMWare infrastructure to scheduled jobs that… You get the idea.
Then last week I was debugging a Python script that moved data from Kafka to Azure Data Lake. There were some issues with an decoding library that broke when trying to deserialize the avro stream. In the end, I punted and used some working Java code to run the deserialization, Python to move the file to ADLS, and a Bash wrapper to tie it all together. In defense of this Frankensteinian melange, these components already existed and needed only minor tweaks to get working. And the solution was just an interim hack until proper development could be started. This is what is known as "famous last words."
A few annoyances occured. The system Python version was 2.7. No big deal. Add a couple from __future__ imports and most of the code just worked. Most of it. One particular library existed solely in Python3. I setup a virtualenv and moved my code over. The ConfigParser library required some minor tweaks. The argparse got replaced with optparse. I checked my code into Bitbucket and wrote an Ansible playbook to deploy it. Alas, it would run as the data transfer user so I had to tweak the code to figure out the Python virtualenv.
Then there was the Java bit which I had received as a maven project. Arguably, if I were in an IDE the setup would have taken minutes. Alas, my "IDE" is GNU screen session and Vim. I manually edited the POM.XML to include dependencies, installed Larry's Java, and built the Development binary. It worked.
But then came time for the Staging environment. Oh durn, it was an interim hack so there were some hard-coded bits that needed to be changed. OK, managing three sets of binaries for the three (or four) environments was not feasible, even for a hack. I added some Java code to read from a configuration file, re-built and redeployed.
Durn, the Staging environment didn't have a Java8 environment. I reworked the configuration to use OpenJDK-8. I re-deployed. I pushed the Bash wrapper script. I pushed the schedule to Jenkins. And yes, it all worked.
But it never sat right with me. The part of me that twitched when a vendor misspelled a blob storage container practically screamed at a solution that involved Java, Python, Bash, Jenkins and even a bit of awk. So I went about manually decoding the avro stream byte by tortuous byte (ok, I exaggerate a bit). I finally got some some working code that built a JSON object from the avro stream.
Oh, but during debugging I'd built a bleeding edge version of the rdkafka library pulled straight from github. This was a C-based avro decoder there only for speed. To get this working required adding a few header libraries and rebuilding. Of course, since it was in my home directory, some LD_LIBRARY_PATH and other path manipulation was needed. So now my stack included Java, Python, Bash, awk, Jenkins and C. And how the heck would I deploy this?
A Better Way
Go could potentially solve this mess. Versioning would not be a problem. The maintainers of the language and standard libraries seem to go to great pains to ensure compatibility. And because it was compiled, I could ship the binary and it was just run without needing a full environment deploy. The fact that the import statement functioned as a requirements.txt also helped with reproducing an environment.
To learn Go, I decided to convert a few Python and Bash scripts over. The first hour was spent familiarizing myself with the toolchain, installing the language, setting up an IDE, and getting a "Hello, World!" running.
My first exercise was to convert a simple Python sorting script to Go. This script read some default parameters from a configuration file, examined the files in a staging directory, and depending on a field in the filename, moved it to a new directory.
The first step was to read the defaults. In Python, I used the ConfigParser library. In Go, the TOML format seemed most accepted and Go-like. Though there were Go INI readers, I decided to go the TOML route for the novelty.
Converted into Go, it looks like this:
Next, we had to sort the files by filename. In Python, I had a list of filetypes and each filename was compared against two lists. If there was a match, the file would be moved.
In Go, there is no "in" operator so I grabbed a function from a Google search that does the equivalent. The overall logic is identical otherwise.
Even for this trivial example, there were benefits to using Go. For example, having a single binary to deploy, rather than pushing a virtual environment, saves time. For this script it is measured in minutes but potential savings on a more complex task, such as the Kafka reader, is easily in hours.
I am in no way knocking Python. It will continue to play a large role in infrastructure automation and tooling, but Go does impress. In just a couple hours I was able to get something functional. After one day, I was already uploading files to Azure, reading Kafka queues, and even have Dockerized versions ready to upload to our K8S cluster.
IT Lessons Learned While Dragon Boating
I was a dragon boater for several years. It was one of the more fulfilling periods in my life and I took away many lessons from those training and racing experiences. We competed around the country from California to Rhode Island to Florida, in rivers and lakes and oceans, in small boats and big boats. Being at an inflection point in my career, I was thinking today about the many lessons learned on the boat and off through literal blood, sweat and tears.
Many Hands Make Light Work
At the start of practice when we lifted the boat from the rack to the waterway and vice versa, the coaches insisted that everyone helped. This wasn't always fun because, damn, that boat was heavy. And at the end of a practice our arms and legs were exhausted and carrying several hundred pounds of dragon boat to shore wasn't easy. But everyone did their part.
During practice we did solo drills. Instead of the full complement of paddlers paddling, only two would be active. It's near impossible to describe how difficult it was to move a boat with twenty people with just one other person. And when the solo drills ended and everyone joined in, it seemed so easy.
And there was the danger. When everyone was paddling it was sometimes easier to "coast" and let your teammates do the work. Do this in a race and you lose.
Sure, in an era of distributed systems and GPU computing with thousands of compute units, it seems that this is patently obvious. But it's deeper than this. Paddling with friends and teammates, we understood that sometimes it was better to leave paddlers on shore if they weren't "pulling their weight" in a literal sense.
I'm glad for my teammates in IT that did their part. Whether it was taking the duty phone or working the extra hour or six to get a server built, they pulled their weight. I'm thankful for the managers that stayed right alongside the engineering team at a 4AM break fix call.
The Drummer and Steersman are Critical
On the dragon boat there are two team members who do not paddle -- the drummer and the steersman. Their functions are critical because of the paddlers are not in perfect synchrony, chaos ensues. Paddles will crash together, momentum is lost, and there is a very real danger that the boat will overturn.
The drummer is not just drumming and instead must be fully cognizant of the paddlers and the course. Too fast a pace at the beginning and the team will be exhausted by the critical final burst.
The steersman is not just steering and must guide not only the boat but the paddlers. In one race, the left side of the boat "held" while the right side paddled. This allowed the boat to execute a sharp turn while other boats went far wide and lost critical seconds. And if the boat approached those turns too quickly, so much time would be lost in the turn that a slower boat could overtake in the turns.
And in IT we have "sprints" and these are so similar to a dragon boat race that this writes itself. Whether this coordination is done by an enlightened manager, a process, team discipline or pure willpower, without the synchrony races will be lost. With "show me the code" and "fail fast" approaches (valid processes, if done right), it's sometimes easy to rush forward.
Equipment Must be Maintained
On a routine basis we patched our boat and our equipment. This was not just for the occasional leak but for broken seats, rough gunnels, and even drum mounts. We maintained our life vests for regulations and our seat cushions for comfort. When new carbon fiber paddles were available, we bought them because they were lighter and stronger.
With budget priorities sometimes focusing on the new shiny, sometimes it's easy to neglect the maintenance. It's easy to neglect training on new technologies.
The Boat Goes On
Finally, during my time on the team, we gained and lost many teammates. Some moved away and left the sport. Some went to other teams. Some passed away. The team remained. A boat without paddlers is just a hunk of wood.
And this is not just in IT, but everywhere. Co-workers come and go. They served critical functions but the team must remain even after teammates move on.