Featured image of post Make Excel PowerQuery/PowerBI fly with DuckDB

Make Excel PowerQuery/PowerBI fly with DuckDB

Read gigabytes of CSV files ludicrously fast in Power Query thanks to DuckDB

In my previous post, I talked about DuckDB. To keep things concise, I left out a cool trick: its performance advantages can also be integrated into Excel and Power BI.

But first, a note on DuckDB reading CSVs

In my previous article, I completely forgot to mention a limitation I ran into when reading CSV files with DuckDB: it only supports UTF-8 encoding. If your files use CP-1252 (common in Windows apps if you’re using them in Spanish), you’ll need to convert them externally first. Even though DuckDB’s read_csv() function includes an encoding='latin-1' flag, it didn’t work for me in this scenario.

Making Power Query read gigabytes ludicrously fast

With that out of the way, let’s put the “Power” in Power Query by combining it with DuckDB to load CSVs much faster and gain the ability to read Parquet files. The majority of you will probably want to use it with PowerBI via the DuckDB Connector, but in this case I’ll be using Excel with just the ODBC driver, because I find it funnier.

Installation is straightforward: download the DuckDB ODBC driver and extract the ZIP file as a folder, put it in a “simple” place like the C:\ directory, and run the odbc_install.exe file. Once done, a System DSN for DuckDB will be created, pointing to the in-memory temporal database. Note that this only works in 64-bit mode, so you will need a 64-bit version of Microsoft Office for it to work.

So before we start with the craziness, let’s run a test with the built-in CSV reader in Power Query. I’ll be using one of the files from the dataset I used on my previous article, with 5.689.513 lines:

let
    Origen = Csv.Document(File.Contents("C:/Users/Edu/Documents/eduardo-tests/common/flight-status-prediction-rob-mulla/Combined_Flights_2018.csv"),[Delimiter=",", Columns=61, Encoding=1252, QuoteStyle=QuoteStyle.None]),
    [...]
    #"Filas agrupadas" = Table.Group(#"Tipo cambiado", {"Airline"}, {{"Time Lost", each List.Sum([ArrDelay]), type nullable number}, {"Flights", each Table.RowCount(_), Int64.Type}})
in
    #"Filas agrupadas"

Unlike just loading the file without further steps, which takes around 15 minutes to load the entire thing before eventually throwing an error for exceeding Excel’s row limit; interestingly by inserting a ‘Group By’ step made it load in 3 minutes and 50 seconds.

Native Power Query CSV connection

This suggests that Excel performs some kind of optimisation for the entire query. Still, that’s embarrassingly slow, especially considering that the inefficient pd.read_csv() from Pandas that I complained about in my previous article managed to load the same file in just 57 seconds.

Clearly, Excel isn’t winning any speed contests here, so let’s bring DuckDB into the picture. To do that in Excel, go to Data > Get Data > From Other Sources > From ODBC.

How to get to “From ODBC”

In the window that appears, select the DuckDB DSN and expand the Advanced options section. Under SQL Statement (optional), enter the DuckDB command to load the CSV file(s). In my case:

SELECT Airline, sum(ArrDelay)"Time Lost", count(Airline)"Flights", count(DISTINCT FlightDate)"Days Operated"
FROM read_csv("C:/Users/Edu/Documents/eduardo-tests/common/flight-status-prediction-rob-mulla/Combined_Flights_2018.csv")
GROUP BY Airline
ORDER BY Airline

Then click OK.

“From ODBC” window should end up looking like this

The first time you do this, a window asking for credentials will appear, just ignore it by clicking Connect.

The credentials window

Now, when I load the data, it takes… only 10 seconds! This nearly brought tears to my eyes. The native Power Query CSV reader doesn’t even come close.

Power Query From ODBC/DuckDB connection

And remember, that’s not the only benefit. With DuckDB, you also gain the ability to read Parquet files and even DuckDB databases directly. A neat trick: when working with DuckDB databases, you don’t need to create a new User DSN for each one. In the From ODBC window with the DuckDB DSN selected, under Connection string type the option database=C:\path\to\database.duckdb with the path to your database.

I’ll leave it here so you can experiment further on your own. For now, this will be my last post related to DuckDB… or maybe not. We’ll see.