With experience in software development, data engineering and machine learning, I specialise in data-intensive problems and decentralised data engineering at scale. My experience extends to leading teams, technical architecture and product development. Find out more about my experience and publications in my portfolio.
Contact me to see how I can help at paul@tempered.works.
I recently discovered and responsibly disclosed a vulnerability in the dbt analytics engineering solution. Google Cloud services are my default choice to process data, so I looked into how I could protect myself from data theft when I'm using BigQuery.
My work on dealing with multiple tables was interrupted when I discovered a subtle scenario that leads to DMS CDC output that cannot be correctly interpreted. I was unable to find a solution, but I will update this post if new information emerges.
In the CDC output, I get a row for each statement executing in the transaction. Each row reflects the state of the database when that statement is executed. How do I filter out all the transient statements to get the final state of the row when a transaction has finished?
Yesterday, Safety told me about CVE-2019-8341, a security issue affecting Jinja2. I'll walk through how I investigated and assessed the risk to my website and a dbt pipeline I operate in the public domain. I finish up with a commentary on why I think this vulnerability is real and should be fixed, and why I think we need to risk breaking potentially insecure usage to make vulnerability management manageable in the real world.
Last time, I set up a CDC system using AWS RDS and DMS services. Now, I'll run some operations through the source database and show what that looks like in the CDC output. I'll introduce some metadata fields that are critical to figuring out what this CDC output means and set us up to look at the specific challenges I've had with interpreting these CDC outputs robustly to solve real-world problems reliably.
Setting up Change Data Capture from Aurora Serverless PostgreSQL to S3 via the AWS DMS service. I'll walk through the demo setup, using the venerable Northwind dataset, calling out the problems and solutions on the way. The next post in this series will show the challenges we hit trying to work with this kind of CDC data and how we dealt with them.
CVE-2018-20225 in all versions of pip tripped my vulnerability alerting this morning. If you're scanning for vulnerabilities using Safety, you've probably seen the same alarm. This post captures my reasoning and decision-making process to understand the risk and impact of this vulnerability and then deal with it.
On updating dbt-bigquery to latest 1.8.0: No module named 'dbt.adapters.factory'. TL;DR - pip install --force-reinstall dbt-adapters following the broken upgrade should resolve the problem. Delete the venv and reinstall from scratch if not. See my comment in dbt core issue 10135 for an explanation of the cause and why this solution works.
We data practitioners - data scientists, data engineers, analytics engineers, et al. - have a hard time when it comes to security. We're exposed to tools that demand we write code and deal with the messy world of programming languages and packages. We often have little choice but to drag insights out of real and sensitive data, exposing us to risks other developers can avoid, because insights don't hide in test data. Training, career paths and dev-experience efforts typically overlook data folks, depriving them of knowledge about the risks they're exposed to and how to mitigate them. Read on and I'll share what I do (and why) to protect myself, Equal Experts and my clients from the security risks lurking behind every piece of software.