Skip to content

2018

Writing on the Dunnhumby Engineering Blog

Dunnhumby is a retail data science company that I've been working with lately. I've enjoyed writing a couple of articles for their Data Science and Engineering blog.

The first is a slightly extended version of an article on here, Scala Types in Scio Pipelines.

The more recent article is original and talks about the experiences we've had putting together streaming demos of real-time streaming data processing solutions. If you're interested, you can find that article at Building Live Streaming Demos

It was also an opportunity to try medium.com out as a technical author. I found the difficulty in inserting code snippets (you embed Codepen or Github snippets, but it's a bit inconvenient compared to just writing the code) and the lack of version control to be the main downsides. The polish and support for writing for a third-party publication were upsides.

Thanks to Dunnhumby for the opportunity to write on their blog!

Setting up this site with GatsbyJS and Netlify

No better time to grab an old Geocities-style under construction gif...

Every company needs a website, and Tempered Works is no exception! Having bought the domain names when I set the company up, I've been putting off getting a website up and running because I'm not really a front-end creative type. When I heard Jason Lengstorf talking to The Changelog about GatsbyJS, I was intrigued... so I tried it out.

Why GatsbyJS?

I think GatsbyJS is interesting when compared to other static site generators because it's based on GraphQL and React. I've never worked with React, so there's an opportunity to learn about that, but I think the GraphQL part is most interesting. The idea is that you can generate content on your site based on queries to other datasources. The queries are done at build time, so you still get a static site, with the associated benefits. Benefits like fewer security considerations (although there are still some - we'll get back to that), many options for cheap or free hosting, great reliability and the potential for super-fast page load times.

Where to Start?

Gatsby provides loads of "starters", projects you can use as a basis for your own. A quick look down the list and I settled for gatsby-starter-lumen. I felt it had a clean, professional look, and it seemed really quick on page loads. A quick gatsby new my-blog https://github.com/alxshelepenok/gatsby-starter-lumen later, and I had a basic project. If you're trying it out for yourself, check out the Gatsby docs to fill in the details I leave out.

I'm not sure whether I'll stick with the theme. Aside from the clean styling, it's the blog aspect and markdown support for posts that I like. After adding a couple of links and a company footer to the sidebar, the mobile view is mostly links and footer! It also feels unnecessarily narrow on my laptop, so code snippets are particular hard to use without scrollbars. We'll see, hopefully it wouldn't be too difficult to switch if I decided to.

Where does GraphQL fit?

After creating a dummy blog post and gatsby develop-ing my site up on localhost:8000, I decided to add social links for linkedin and stackoverflow. Each component and each page looks up the data it needs with a GraphQL query. Where were the social links coming from?

The social links appear in the sidebar on every page, so the details are kept in the gatsby-config.js file, under siteMetadata > author. This config file is available to query, and each page does exactly that. For example, the index page uses this query. These pieces of data are then rendered in the Links component, which is used in the Sidebar component, which is itself used in almost every page.

So - to add these links, I need to: - add the details for my new social links to gatsby-config.js, - update the queries to fetch those new links, - update the Links component to render the new links.

Unfortunately, I need to update the query to include the new links on every page that uses the sidebar! That got tedious fast, but Gatsby and GraphQL have a solution - fragments. After defining a query fragment to fetch the author details, I swapped the fragment into every query that used the author details. Adding or removing author details can now be done in one place. Gatsby's GraphQL document is a must-read!

Why Host when you can Netlify!

Netlify was the obvious choice to host this site. It's free for a simple, single-user site like this and it knows how to deploy a Gatsby site. All I had to do was authorise access to my Github account, select the repository I wanted to deploy and wait a few seconds for the site served on a https:// URL with a randomly generated host to build and deploy. That leads us neatly to security and performance!

What About Security?

Even though this is a static site, there are still ways it could be abused. We don't have the traditional backend attack vectors because we don't have a server or a database. Bad actors could still get creative with JavaScript, iframes, and so on to compromise your computer or influence what you're seeing on this site. I used Mozilla's Observatory to scan the site that Netlify launched for me, and it got a D+ rating. Could be worse, I guess, but that's not good enough!

It's possible to influence the headers that Netlify serves. To keep things tidy, there's gatsby-plugin-netlify, a Gatsby plugin that can make the header configuration part of your Gatsby configuration. I started by adding the headers that Observatory recommended, to get an A+ rating. Then I relaxed the rules until the site worked again!

I like that approach, particularly when I'm using an open source project like Gatsby and the Lumen theme, because you essentially get a guided tour of what the site is doing that has security implications. I also caught a mistake because of these headers. I'd left a Giphy link to an image instead of using the site's local copy. The CSP headers disallowed it because they only allow images to be served from 'self' and Google Analytics.

It took about 10 commits before I was happy-ish with the headers and the site was working without any errors in the JavaScript console. The site gets a B+ right now, with the remaining issues being Content Security Policy specifications that are a little more lenient than we'd ideally like. It looks like the Gatsby team is working on dealing with those remaining issues.

The CSP headers I ended up with were quite verbose, and Gatsby's config file is JS, so I added a bit of code to make things a little more maintainable.

const cspDirectives = [
  "default-src 'self'",
  "script-src 'self' 'unsafe-inline' https://www.google-analytics.com",
  "font-src 'self' https://fonts.googleapis.com https://fonts.gstatic.com",
  "style-src 'self' 'unsafe-inline' https://fonts.googleapis.com",
  "img-src 'self' https://www.google-analytics.com"
];

const directivesToCspHeader = headers => headers.join(';');

I can now use these in the config like this:

{
  resolve: 'gatsby-plugin-netlify',
  options: {
    headers: {
      '/*': [
        'X-Frame-Options: DENY',
        'X-XSS-Protection: 1; mode=block',
        'X-Content-Type-Options: nosniff',
        `Content-Security-Policy: ${directivesToCspHeader(cspDirectives)}`,
        'Referrer-Policy: no-referrer-when-downgrade'
      ]
    }
  }
}

Here's the observatory's advice on those headers.

Mozilla's Observatory, showing the summary for the website

What about Performance?

A similar approach to benchmark performance, using Google's Page Speed tool. Right now, we're getting 71% on the mobile optimisation benchmark, and 90% on the desktop benchmark. Whilst the site feels very snappy to me, there's probably work to do there when I have time, but at least I have a measurement to start from.

Google's Page Speed tool, showing the poor mobile performance for the website

Monitoring

The last thing to touch on is the boring operations stuff. How will I know if the site goes down or goes slow, particularly as I don't have any servers to alert me? My go-to tool for this kind of thing was Pingdom, but it looks like they've done away with their free tier. If I recall correctly, it used to be free to healthcheck two URls. Now you get a 14 day trial.

We can't really complain when previously free services change their terms, but before signing up I checked whether anyone else was doing this basic health checking, and I found UptimeRobot. They have a generous free tier, so I signed up there instead and pointed them at the test site. It's been checking for three hours now and everything looks good. I can also see that the response times are between 150-250ms, which is a useful measure to have historical data on!

Uptime Robot's dashboard for availability and latency history, showing 100% availability and latency between 150-250ms

Finally... DNS and TLS Setup

The last thing to do is migrate the DNS records over to Netlify, so that https://tempered.works points to the Netlify site! I bought the domain though Hover after recommendations by Steve Gibson on the Security Now! podcast. Hover is fine, but they don't support CNAME flattening, ANAME or ALIAS records that are required by Netlify to get the full benefits of an apex domain. tempered.works is an apex domain, www.tempered.works would be a non-apex alternative. I want tempered.works to be my domain!. I could move my DNS to Netlify but I'm trying just pointing A records to Netlify's load balancer for now. You may want to choose a DNS provider that supports those newer record types if you intend to host on cloud services!

Of course, now I'm using my own domain name I need a TLS certificate that matches. Netlify's got me covered - it automatically provisioned me a free Let's Encrypt! certificate for my domain. It took over half an hour, but that's no problem. Once the certificate was provisioned, I got the option of forcing connections to https://, so I turned it on. Why would you want to access this site over plaintext anyway?

That's it - tempered.works is online!

Credits

  • Under construction gif courtesy of https://giphy.com/stickers/please-construction-patient-JIejyxfnKRVv2

Scala Types in Scio Pipelines

Data pipelines in Apache Beam have a distinctly functional flavour, whichever language you use. That's because they can be distributed over a cluster of machines, so careful management of state and side-effects is important.

Spotify's Scio is an excellent Scala API for Beam. Scala's functional ideas help to cut out much of the boilerplate present in the native Java API.

Scio makes good use of Scala's tuple types, in particular pairs (x, y). Its PairSCollectionFunctions add some neat, expressive functionality to the standard SCollection to compute values based on pairs.

That capability lets you write really concise code, but can make it hard to make sense of types in the middle of your pipeline. Using Scala's type keyword to alias common types can bring more clarity to your code.

An Example: Counting in Access Logs

At this point, I think I need an example. Let's say we're processing simple web server access logs. I want to know how many times each user accesses each URL and the status code they received.

Here's an example of a line from our logs:

1.2.3.4,bob,2017-01-01T00:00:00.001Z,/,200

We don't need to worry too much about where the logs are coming from. Aside from this just being an example, Beam has numerous adapters for different data sources.

I first write a case class and a parse function to turn these useless strings of characters into something nicer to work with.

object AccessLog {

  case class Entry(clientIp: String, userId: String, timestamp: Instant, path: String, statusCode: Int)

  def parseLine(line: String): Entry = line.split(",") match {
    case Array(clientIp, userId, timestamp, path, statusCode) =>
      Entry(clientIp, userId, new Instant(timestamp), path, statusCode.toInt)
  }
}

Now, we can build a pipeline starting with this parse function. We'll build up the pipeline step by step, detailing the type signature at each point. The type at the end of the pipeline will be indicated with a comment on the next line.

sc.textFile(args("input"))
  .map(AccessLog.parseLine)

// SCollection[AccessLog.Entry]

So far so good. Now, let's map AccessLog.Entry onto the key we want to group by.

sc.textFile(args("input"))
  .map(AccessLog.parseLine)
  .map(x => (x.userId, x.path, x.statusCode))

// SCollection[(String, String, Int)]

Yuk. Now we need to remember that the first String is the userId, the second is the path and the final Int is the statusCode. It gets worse when we start aggregating, adding more complexity and numbers into the mix.

sc.textFile(args("input"))
  .map(AccessLog.parseLine)
  .map(x => (x.userId, x.path, x.statusCode))
  .countByValue

// SCollection[((String, String, Int), Long)]

This is a very simple pipeline. When you've got something more complex it gets harder to keep track of what these types mean, and when you are working with more than one pipeline it's harder still. The type system can help more than it is, so let's use it.

Once More, With Type Aliases

OK, so let's back up, and use Scala's type keyword to make the type signatures a bit more useful. Our parsing function is a convenient place to introduce additional type information to flow through the pipeline.

type ClientIp = String
type UserId = String
type Path = String
type StatusCode = Int
case class Entry(clientIp: ClientIp, userId: UserId, timestamp: Instant, path: Path, statusCode: StatusCode)

That's it. Everything still type-checks, as the "real" types haven't changed. Our new aliases will now flow through the pipeline code, allowing us to see what the types really meant at each point. Let's retrace our steps and see how these new types help us out. This time, I'll comment the types at each step for brevity.

sc.textFile(args("input"))
  .map(AccessLog.parseLine)
  // SCollection[AccessLog.Entry]
  .map(x => (x.userId, x.path, x.statusCode))
  // SCollection[(UserId, Path, StatusCode)]
  .countByValue
  // SCollection[((UserId, Path, StatusCode), Long)]

An IDE like IntelliJ (keyboard shortcut Alt-=, probably something slightly different on a Mac) will tell you what the values you're dealing with as you code. The type alias syntax is concise too, much better than having to create classes. It's not a lot of extra thinking or typing for a significant increase in the amount of information you have as you're writing or debugging a pipeline. The custom parse function early in the pipeline provides a neat starting point to inject this type information and have it flow through the rest of our pipeline.

Source code for this example can be found at https://github.com/brabster/beam-scala-types-example