At hotglue, we power thousands of data syncs every day — improving the performance of these syncs translates into huge cost savings and better experiences for our users. Our customers count on us to deliver connectors that are reliable, fast, and ensure high data quality.
In this post, we’ll walk through how our team optimized our Microsoft SQL target and significantly improved throughput. We’ll compare three approaches — pymssql, pyodbc, and bcp — and share the results of our testing.
Let’s dive in!
Connectivity issues with pymssql
Our original Microsoft SQL Server target was built using SQLAlchemy with pymssql.
An AI startup was trialing our product, and their use case centered around reading large amounts of data from CRMs like HubSpot and Salesforce and writing it to their Azure SQL server.
The first issue we encountered was with connecting to Azure SQL Server. It turns out this was a common issue with pymssql, as Azure SQL Server requires encryption and the pymssql wheel does not have SSL bindings by default (there’s a nice StackOverflow thread about this).
Switching to pyodbc
Rather than fiddle with installing pymssql from source, we switched to using pyodbc with the official Microsoft SQL Server driver. This was easy enough as we were already using SQLAlchemy – just changing our connection URL and installing the ODBC driver was enough:
This solved the initial issue and we were able to connect to Azure SQL Server! However, we immediately ran into performance issues – writing even relatively small amounts of records (we tested with batches of 10,000) was slow. Worse, we seemed to have intermittent timeout issues while trying to write data, with error messages like:
Trying to boost pyodbc performance
Doing some research we found that other folks had hit similar performance constraints with pyodbc – the recommendation was to try using the fast_executemany=True flag. Again, this was straightforward thanks to SQLAlchemy:
However, this broke our core logic as our target initially creates local temporary tables (denoted in MS SQL using a #). The problem is that fast_executemany breaks with local temp tables – that rabbit hole took us to a GitHub issue thread on the pyodbc repo discussing possible solutions.
According to that thread, we could have avoided this error by adding the flag UseFMTONLY or switching to global temp tables. However, we were still not convinced this fast_executemany flag would solve our core performance issue, so we wrote a simple workaround to create “fake” temp tables prefixed with TEMP_
So we dropped the original temp table creation logic:
and replaced it with creating this “fake” temp table.
As you can see, in doing this we switched the approach for creating the table and manually built the column_definitions. We did this by querying the sys.columns and sys.types tables in MS SQL:
Finally, we were able to test our change with a batch of 10,000 records:
Yikes… 4 minutes 40 sec just to insert 10,000 records was not even close to enough of a performance gain. Worse still, the timeout issues persisted.
Using bcp for the inserts
After that realization, we went back to the drawing board and did some research on what the most performant ways to write to Microsoft SQL Server are. This is when we found Microsoft’s bulk copy program utility (aka bcp) – if you’re like me, that name made you chuckle :)
The concept of bcp is pretty simple, it’s a CLI that enables you to insert CSV files into Microsoft SQL Server super quickly. Plus, it’s designed specifically for loading large amounts of data.
We were a little dubious that bcp could be that much faster, so we did a local test. It did not disappoint – we were able to write 10k rows in just 8 seconds. We had found our solution!
Updating the target was a little trickier.
We decided to continue using pyodbc for everything except the inserts. We would use it to create the tables, adding missing columns, ensure schemas, and create the “fake” temp tables. Note that we kept our “fake” temp table solution intact because bcp (similar to fast_executemany) behaved strangely with local temporary tables.
To generate the CSV files for bcp to insert we decided to use pandas – the code was pretty straightforward. We converted our array of insert_records (a list of dictionaries) into a dataframe. Notice that we had to remove new lines and tabs from the input data to avoid parsing errors with bcp . Finally, we serialize the dataframe to a tab delimited CSV with no header.
Now, for the fun part! We dropped the old INSERT code and replaced it with a call to bcp, passing the fully qualified table name, database credentials, and the path to the CSV file.
The rest of the code stayed the same (and used pyodbc) – once the insert completes, we simply merge this temp table into the main table + drop the temp table.
With all our changes done, we ran a test in production and saw a huge impact: runtime was down to 1.6 sec to upload the data to temp tables! Our logs below:
Conclusion
There were a few small clean up things to do (e.g. adding support for databases on custom ports), but these changes transformed our Microsoft SQL target to a far more performant and reliable connector. Our customer was able to sync millions of CRM records into their Azure SQL Server instance with ease.
If you’re interested in checking out the code in more detail, check out the GitHub repo: https://github.com/hotgluexyz/target-mssql/
Thanks for reading, and stay tuned for more deep dives into how we are building high-performance connectors!