Avoiding CAST ROW in AWS Athena SQL
There is more than one way to build a ROW
in AWS Athena and the underlying Trino engine. It turns out I was doing it the verbose, brittle and really annoying way. Row subqueries are so much better!
Paul Brabban, Lead Consultant at Equal Experts
With experience in software development, data engineering and machine learning, I specialise in data-intensive problems and decentralised data engineering at scale. My experience extends to leading teams, technical architecture and product development. Find out more about my experience and publications in my portfolio.
Contact me to see how I can help at paul@tempered.works.
There is more than one way to build a ROW
in AWS Athena and the underlying Trino engine. It turns out I was doing it the verbose, brittle and really annoying way. Row subqueries are so much better!
I noticed failing builds this week, with a dbt-core vulnerability flagged by SafetyCLI as the culprit. There was no actionable information in the command line output, so here's what I found and my action.
The Oct 15, 2024 update of Google Chrome stable (130.0.6723.58) suddenly broke some sites, such as GitHub Codespaces and 1Password, due to a JavaScript-related setting.
In the last article of the series, I ran into difficulties as I tried to "time-travel" back to earlier points in time. That's an important capability for correct functioning and reproducing results. This article shows how to use window functions (also known as analytic functions) to simplify handling of processing time and avoid the previous problems.
This walkthrough shows how I can use the functional programming techniques map
and filter
that I already know and love in SQL engines like BigQuery. These techniques give me a lot of processing power whilst keeping my SQL simple and relatively easy to understand. Unlike custom code, I can use the same SQL and infratructure I'd use to process ten rows to process ten billion rows in seconds.
I recently discovered and responsibly disclosed a vulnerability in the dbt analytics engineering solution. Google Cloud services are my default choice to process data, so I looked into how I could protect myself from data theft when I'm using BigQuery.
In previous posts, I disambiguated transactions by filtering out any transient statements and noted that changes in primary key values cause big problems. Now I can start to answer useful questions about the current and historical changes in a source table, learning something important about window functions along the way.
My work on dealing with multiple tables was interrupted when I discovered a subtle scenario that leads to DMS CDC output that cannot be correctly interpreted. I was unable to find a solution, but I will update this post if new information emerges.
In the CDC output, I get a row for each statement executing in the transaction. Each row reflects the state of the database when that statement is executed. How do I filter out all the transient statements to get the final state of the row when a transaction has finished?
Yesterday, Safety told me about CVE-2019-8341, a security issue affecting Jinja2. I'll walk through how I investigated and assessed the risk to my website and a dbt pipeline I operate in the public domain. I finish up with a commentary on why I think this vulnerability is real and should be fixed, and why I think we need to risk breaking potentially insecure usage to make vulnerability management manageable in the real world.