Arrows, Ducks and Time in NodeJS

21 Aug 2023

Some months ago i stumbled upon DuckDB, which is often described as SQLite for analytics. Since i’m a huge fan of everything that does not require its own process managed by me, I immediately recorded DuckDB into the list of things to check out, with maximum priority. Some weeks ago I finally had the time to play with it and I was very impressed.

As a quick Proof of concept i decided to try the NodeJS client API, since NodeJS is the runtime powering the majority of our infrastructure, but DuckDB has bindings for a lot of languages, such as Python and Rust.


// duckdb-async is an async wrapper around the official duckdb api
const duckdb = require('duckdb-async');
async function main() {
  const db = await duckdb.Database.create(':memory:');
  await db.all('INSTALL httpfs');
  await db.all('LOAD httpfs');
  await db.all("SET enable_object_cache=true;");
  await db.all(`SET s3_region='MY_REGION';`);
  await db.all("SET s3_access_key_id='MY_KEY';")
  await db.all("SET s3_secret_access_key='MY_SECRET';")
  const output = await db.all("SELECT time,\"absorbedActivePower\" FROM read_parquet(['s3://my_bucket/my_file1.parquet']);")
} 

main()

There’s a lot going on here, but the abstraction is so good, that it only took a dozen lines of code to

Create an in memory instance of DuckDB
Load the HTTP extension required to query S3
Set AWS S3 credentials
Query two columns from a parquet file stored in S3 (with ZSTD compression, automatically detected)

The output is something like:

[
  {
    time: 1625011200000,
    absorbedActivePower: 100
  }
]

This output format is just perfect for me, since this is the format we typically use at work to represent points in time series and we happen to have a TypeScript data frame library to handle such data. This library is obviously out-performed by Pandas and Polars, but it served us well in both backend and frontend typescript based apps.

const {TimeFrame} = require('@apio/timeframes')
// duckdb-async is an async wrapper around the official duckdb api
const duckdb = require('duckdb-async');

async function main() {
  const db = await duckdb.Database.create(':memory:');
  await db.all('INSTALL httpfs');
  await db.all('LOAD httpfs');
  await db.all("SET enable_object_cache=true;");
  await db.all(`SET s3_region='MY_REGION';`);
  await db.all("SET s3_access_key_id='MY_KEY';")
  await db.all("SET s3_secret_access_key='MY_SECRET';")
  const output = await db.all("SELECT time,\"absorbedActivePower\" FROM read_parquet(['s3://my_bucket/my_file1.parquet']);")
  const tf = new TimeFrame({data:output})
	  .resample({interval:1000*60*15,operation:'avg'}) // we compute 15 minute average of the active power and from 1 full day of data we get 96 points
	  
  console.log(tf.shape()) // outputs [ 96, 1 ]
} 

main()

Not having to transform data is a big deal for us, we handle devices that logs 50-ish measurements every minute and our API servers are written in NodeJS, which is notably not-super-good at CPU heavy operations.

Another nice feature of DuckDB is that it allows you to return query output in Arrow IPC format, which allows different processes to share data with zero-copy and without serialize/deserialize data.

If you were to pass time series data from your server to a client application for visualization using JSON or other formats, you’d have to

Serialize your in memory data to JSON in the backend
Pass it to the client
De serialize from JSON to in memory format
Render the data

With arrow you just send the Arrow formatted data into the wire and the client can just load it into memory.

All of this fits very well into our use cases, primarly because storing a huge amount of timeseries data into regular databases, even the timeseries oriented ones becomes very expensive very quickly: we write hundreds of thousands of data points every minute in several instances (on-premise and managed) of our platform, for this reason we usually rotate old data to S3 (or similar cloud storage offerings) as zstd compressed parquet files. These files are indexed by:

namespace: e.g. organization/client
table: the device that produced those measurements
columns: the single measurement

Then we have an api that transforms queries into file paths, read the wanted columns and then transform data according to the query.

With DuckDB we just have to create a query, it will handle the rest for us, plus the output is compatible with our tooling, so we don’t need to perform expensive transformations.

Fat Codes

Software engineering related ramblings by fat

Arrows, Ducks and Time in NodeJS