A data science blog

Wanderings in data science, data applications and data engineering.

Implementing a Decision Tree from scratch using C++

Lessons from a Data Scientist Python has risen to become the king of languages for data science. Most new data scientists and programmers continue to learn Python for their first language. This is for good reason; Python has a shallow learning curve, a strong community and a rich data science ecosystem of libraries. I started my Data Science journey with Python, and it continues to be my most common tool of choice solving Data Science problems....

 · 7 min · Hamish Lamotte

AWS Lambda Batch and Trigger Parser

Lambdas are an ideal tool available on AWS for parsing files landing in S3 as part of an ETL pipeline. Setting up a parsing process with both catch-up (historical files previously landed in S3) and new file parsing functionality requires a bit of extra work. We have created a framework to quickly create a parsing solution with both of these two core functionalities and the code is available here. Solution details The diagram below shows the architecture of the solution....

 · 2 min

AWS generate S3 objects manifest using python

AWS S3 Batch Operations is a solution to quickly process large quantities of ETL data by invoking a Lambda Function, however you first need to create a manifest file describing all the objects you want to process. I couldn’t find any quick solutions to easily create these manifests online so I put together a solution in Python. You can find the GitHub code here. Instructions to generate an S3 Manifest CSV For creating a csv manifest list of all files in an S3 bucket with a certain prefix and suffix....

 · 1 min · Hamish Lamotte

Incremental window functions using AWS Glue Bookmarks

The out-of-order data landing problem Applying window functions over data is non-trivial if data arrives out-of-order (with respect to the dimension the window function is applied across). For clarity, lets take timeseries data for this example as our window dimension. If timeseries data arrives from Tuesday through Thursday of a week, then at a later time data from Monday of that week arrives, the data has arrived out-of-order. As a window function output is sensitive to its surroundings in timespace, the results of the window function would be altered by the new out-of-order data that landed....

 · 3 min · Hamish Lamotte

Incremental join using AWS Glue Bookmarks

The problem I was recently presented the challenge to join two timeseries datasets together on their timestamps without requiring the corresponding data from either dataset to arrive at the same time as the other. For example, data from one day last month from one dataset may have landed on S3 a week ago, and the corresponding data from the other dataset for that day last month may have landed yesterday. This is an incremental join problem....

 · 4 min · Hamish Lamotte