R for System Administrators

R Quickies: Parsing JSON Data

Author: Kwan Lowe

Date: 2015.07.15


So yesterday I was tasked with getting some information from AWS for later analysis. AWS has a complete and easy-to-use command-line front-end that works well with whatever text processing tools are in your toolkit. I happen to be using R at the moment, and though it was not designed as a general-purpose language for system administration, it works really well for certain tasks such as when your input is any sort of formatted data and you need to automate a process and export it to a format consumable by non-Linux folks (e.g., Excel).


Now, one caveat: Everything I do here could as easily be done in other languages. The AWS CLI can output to text or JSON. It can filter and export in different ways. This output can then be parsed with awk, jq, Perl, Python or ruby using a multitude of libraries.


So let's try to parse some data in R.


In the AWS text format, the output looks something like below, which may be fine for reading but not so easy to machine parse:

TAGS aws:autoscaling:groupName DIGITALHERMIT-DEV-005

RESERVATIONS 810123459738 r-abcd1234

INSTANCES 0 x86_64 LqjVd1401234567890 False xen ami-6e09e006 i-045deb47 m2.xlarge aki-08ed0761 RCCL 2014-05-20T05:11:54.000Z ip-192-24-1-101.ec2.internal 192.22.1.101 None /dev/sda1 ebs True None subnet-df2c0599 paravirtual vpc-24626b46

BLOCKDEVICEMAPPINGS /dev/sda1

EBS 2014-05-20T05:11:58.000Z True attached vol-22e21234

MONITORING disabled

NETWORKINTERFACES Primary network interface eni-3a671234 81074942abcd 192.23.1.101 True in-use subnet-df2c1234 vpc-2462abcd


Since AWS can output to JSON which works well for this sort of data, I just let AWS do the work for me:

aws --output json ec2 describe-volumes --query 'Volumes' > /src/R/aws_volumes.json


For the purposes of your script you could opt to either save the file locally as we did here or retrieve it each time the script is run. Because my data doesn't change often I saved it. And normally if you're saving the file, Instead of filtering out the volumes at the source, you may opt to pull the entire output from "describe-volumes" and save that.


Like other languages, R has a rich library of functions. Half the job of writing scripts is knowing your library. Check out http://cran.r-project.org/web/packages/ for lots more. For this script I'll pull in the dplyr and jsonlite packages.

library(dplyr)

library(jsonlite)



Next, read in the output we saved earlier into a data frame, which is an internal data structure that R can access easily:

aws_volumes <- data.frame(fromJSON("aws_volumes.json"))


Here's where R pays off. We can now create a report based on whichever fields we need. In this case, we can pull the VolumeId, Size, VolumeType and CreateTime. These were needed to determine how much we were spending and how long ago we'd created them:

select(aws_volumes, VolumeId, Size, VolumeType, CreateTime)


If I want to generate some summary metrics on the Volume size, it's as easy as:

summarize(aws_volumes, sum(Size))

summarize(aws_volumes, mean(Size))


And because the finance folks speak in the strange tongue of Spreadsheet, we can easily generate reports for them:

write.csv(select(aws_volumes, VolumeId, Size, VolumeType, CreateTime), file = "outfile.csv" )

Or if they prefer, the Excel dialect of Spreadsheet:

library(xlsx)

output <- select(aws_volumes, VolumeId, Size, VolumeType, CreateTime)

write.xlsx(output, file = "awsVolumes.xlsx")

We could all assemble it into a script using the Rscript executable:

#!/usr/bin/Rscript


library(xlsx)

library(dplyr)

library(jsonlite)

aws_volumes <- data.frame(fromJSON("aws_volumes.json"))


output <- select(aws_volumes, VolumeId, Size, VolumeType, CreateTime)

write.xlsx(output, file = "awsVolumes.xlsx")

Though Python, Ruby, awk or other scripting languages may be more versatile, R makes quick work of manipulating many types of data. I've found it useful for processing sysstat output, running benchmark analyses, generating graphs, and even running anomaly detections.

R Quickies: Working with SyStat Data

R for SysAdmins

Working with sysstat Data


Author: Kwan Lowe

Date: 2015.09.08


The previous introduction on R for SysAdmins received a good response. Several asked about using R for graphing sysstat as I mentioned in the conclusion. This tutorial walks through using R to produce graphs of that data.


Keep in mind that R is not traditionally considered as a language for system administration. Normally we turn to bash, awk, Python or Ruby for many tasks. However, R is particularly well-suited for working with data and is arguably the best tool for the job.


This tutorial will walk through graphing the output from the sysstat/sar utility. There are other tools such as kSar, Cacti, etc. that do similar functions. R, however, can automate some of the analysis functions that are missing in most (all?) the freely-available tools. For example, we can easily run Bayesian analyses on our data to alert on resource trends or run an anomaly detection function to see hotspots in activity. Creating interactive applications for end-users is, though not trivial, facilitated with the many web packages available for R.


First, we generate some data using the sar utility.

sar > sar.out

This generates a file containing this output:

Linux 3.10.0-229.11.1.el7.x86_64 (cerberus.digitalhermit.com) 09/08/2015 _x86_64_ (2 CPU)


12:00:01 AM CPU %user %nice %system %iowait %steal %idle

12:10:01 AM all 0.11 0.00 0.17 0.02 0.00 99.70

12:20:01 AM all 0.08 0.00 0.14 0.01 0.00 99.77

[...]

02:10:01 PM all 1.17 0.00 0.30 0.06 0.00 98.48

02:20:01 PM all 0.86 0.00 0.21 0.02 0.00 98.92

Average: all 0.16 0.00 0.12 0.03 0.00 99.70


There are many options to sar but these are beyond scope of this tutorial.

Now we can start constructing our R script. First is to load in a few libraries to handle dates and graphics. Check the resources section below for more information on these packages.

library(lubridate)

library(lattice)


Next, we wead the input into a raw file. This will facilitate dealing with the non-data lines later. The sar output from my machine contains three lines of header and a final line with averages.

sar_raw <- readLines("sar.out")


The first line of our input is useful to keep. So we pull into its own variable for later use. We use the strsplit() and unlist() functions to parse the character vector (a "string" in normal Linux parlance). We also use the %>% operator to chain inputs, much like the pipe (|) char does on the command line.

sar_title <- sar_raw[1] %>% strsplit("\t") %>% unlist()


From the sar_title header, we can extract some useful bits such as the date of the report. This is important because the date is only kept in the header. To extract the date of the report we use the lubridate::mdy (month-day-year) function which extracts the time from a character string:

report_date <- mdy(sar_title[2])


This allows us to pull the year, month, date, etc. with

year(report_date)


Next, we actually grab the data into a proper data table (the above was just a raw read into a temporary table). This allows some easier manipulation such as dropping the last few lines.

sar_dat <- read.table(text=sar_raw, skip=2, nrows=(length(sar_raw) - 4), header=TRUE )


At this point we can delete the temporary sar_raw table:

rm("sar_raw")

Now we rename the tables to make it look prettier when graphed. There are other ways of labeling but this works in a pinch.

names(sar_dat) <- c("Time", "MM", "CPU", "User", "Nice", "System", "IOWait", "Steal", "Idle")

The main issue with using the names() function is that the input data order may change. Not a big worry here, but keep this in mind if you are reading in other types of data. To avoid this, you can explicitly rename the columns using the setnames() function.

# data.table package.

# library(data.table)

# setnames(mydata, "newname", "oldname")


Then, add a new column pTime that's a combination of the Time and MM columns. Though the character vector for holding the date/time is perfectly readable for humans, to use it properly in R we need to convert it to a Posix date format. A couple notes:

  • We use the %I to indicate a 12-hour format. If %H were used, it would assume 24hr format and drop the %p specifier.

  • By default, the posixct time stamps gets the current date and time. We clean this up by using the date info we saved earlier in report_date.


We use dplyr::mutate to add another column to our data frame that contains the Posix date. We then add another column, pDate, that parses the pTime column into a Posix date and corrects the timestamp to that of the report date.

sar_dat <- within(sar_dat, pTime <- paste(Time, MM))

sar_dat <- mutate(sar_dat, pDate = parse_date_time(pTime, "%I%M%S %p", tz="America/New_York"))

year(sar_dat[,"pDate"]) <- year(report_date)

month(sar_dat[,"pDate"]) <- month(report_date)

day(sar_dat[,"pDate"]) <- day(report_date)

Now, say we're only interested in the CPU information. We can create a new dataframe containing just that information:

sar_cpu <- select(sar_dat, pDate, User, System)

Finally, graph it using the lattice graphics package. Here we plot the User vs pDate with data found in sar_cpu.

xyplot(User ~ pDate,sar_cpu, type=c("l"))

To make it complete, we add a title using the main= parameter, substituting the header from the sar_title we saved earlier.

sar_plot <- xyplot(User ~ pDate,sar_cpu, type=c("l"), main=sar_title[1])

print(sar_plot)


This produces the following graph:

For these examples I deliberately added a few extra steps for clarity. Even so, this is less than 20 lines of code. My actual R-script is under ten lines. It would be difficult to do it as easily in any other language. Even using Excel, which is notoriously difficult to automate, it takes multiple lines of code just to properly format the data.

Anyhow, I hope this whets your appetite for learning more about R.

More examples:

library(gridExtra)

plot1 <- xyplot(System ~ pDate,sar_cpu, type=c("l"), main="System CPU")

plot2 <- xyplot(User ~ pDate,sar_cpu, type=c("l"), main="User CPU")

grid.arrange(plot1, plot2)