Retrieving data from a datawarehouse is a common operation for any data scientist. In August 2021 databricks released a blog post describing how [Databricks] achieved high-bandwidth connectivity with BI-tools. In it, they introduced cloud fetch, promising a 12x experimental speedup on a dataset with 4M rows and 20 columns, achieved mainly by doing downloads in parallel. When I read this I immediately dove head-first into the rabbit hole, hoping to reduce the time from running a SQL query to having it inside a pandas dataframe. This blogpost details the journey on how I achieved a significant (almost more modest) speedup for our databricks queries.

## The baseline

Our reference dataset is a sample of 2M rows from a table with 146 columns of mixed types. Initially, I was using a basic setup of databricks SQL connector. Very easy setup and it worked great for smaller queries, but for larger queries it got slow quickly:

Method Speed
Baseline 6m57s

I had already figured out one optimization while browsing the documentation: using .fetchall_arrow() (link) instead of .fetchall(). This “gets all (or all remaining) rows of a query, as a PyArrow table”. That helped a lot:

Method Speed
Baseline 6m57s
.fetchall_arrow 3m38s

## Cloudfetch

Hoping for blazing speeds, I set up the databricks custom ‘Simba’ ODBC drivers as instructed. Getting the connection string exactly right together with active directory tokens took quite an effort, but once I got connected I ran the benchmark:

Method Speed
Baseline 6m57s
.fetchall_arrow 3m38s
Cloudfetch 4m24s

Significantly slower ! This was disappointing. I re-ran the benchmark in different time periods but busy clusters could not explain the slower results. I had to dig deeper.

Reading everything I could find online about cloud fetch and the databricks ODBC drivers, it seems you cannot see whether Cloudfetch is actually enabled or working (update: you can now explicitly set the cloud fetch override on a cluster configuration). **I did find a section stating Databricks automatically disables Cloud Fetch for S3 buckets that have enabled versioning. I checked with an infra engineer, and this was not the case. We tried running queries over a custom proxy to monitor traffic, and it did seem multiple connections were opened. ODBC logs showed the file connections also.

So, likely Cloudfetch was working, but something else was going on.

## Back to basics

I estimated the final pandas dataset size using pandas’s .memory_usage(deep=True) (link) to be ~1.5Gb. The benchmark timings translate to ~4.5Mb/s. The cloudfetch blog is stating 500 MB/s.

I ran a speedtest on my compute instance and confirmed bandwidth was not the problem (a comfortable 8000 MB/s down and 400 MB/s up).

Together with the infra engineer we were not being able to detect anything wrong with the databricks / cloudfetch setup, so I tried something else.

## arrow-odbc

Given the first speedup in the databricks SQL connector was due to using arrow tables, I searched and found the arrow-odbc-py project. It “Reads Apache Arrow batches from ODBC data sources in Python”.

Method Speed
Baseline 6m57s
.fetchall_arrow 3m38s
Cloudfetch 4m24s
arrow-odbc 1m25s

That is a very nice speedup! This connection allows you to tweak the batch sizes, so as a proper data scientist I decided to run some more benchmarks and optimize the batch size parameter. I highly recommend the memo package for this kind of analysis. Tweaking the batch size helped but the performance gains were not huge across datasizes.

## Turbodbc

Another project that should be mentioned is Turbodbc. It’s a python project which uses many optimizations (like arrow and batched queries) to offer superior performance over ‘vanilla’ ODBC connections.

Method Speed
Baseline 6m57s
.fetchall_arrow 3m38s
Cloudfetch 4m24s
arrow-odbc 1m25s
turbodbc 1m10s

Using the memo package I tweaked many settings, including Using asyncio, strings_as_dictionary and adaptive_integers. The gains were minor but still worth exploring the combinations.

A downside of Turbodbc however is that you need additional software to compile the C++ code that is required for installation. The package is also available on conda but installation was still less straightforward.

## The final setup

Ease and reliability of installation is important. We need the connection to databricks during batch deployments, CI builds and working on various compute instances. So we decided to drop turbodbc and go for the simpler (and slightly less fast) arrow-odbc. Here’s what usage could look like:

import pandas as pd

query=f"select * from your_table",
connection_string=get_your_connection_string(),
batch_size=20_000,
)
return None

dfs = []
# Process arrow batches
dfs.append(arrowbatch.to_pandas(timestamp_as_object=True))
if dfs:
return pd.concat(dfs, ignore_index=True)
else:
return None


## Conclusion

Investing some time in optimizing frequent and slow operations definitely pays off. In this case queries are >4x faster.

The rabbit hole is much deeper however and the potential for further speedups is still significant. For example, Databricks delta tables use parquet files under the hood, which means it might be possible to hook them up to duckdb, which in turn has many optimizations for fetching data. And there’s the apache arrow flight project announced in February 2022 that aims to get rid of many intermediate steps and natively support columnar, batched data transfers.