If you’re a B2B developer building a product, one of the earliest product development phases is creating a data integration pipeline to import customer data.
In this article, I’ll show you how to leverage Singer’s tap-salesforce to extract data from Salesforce. From there I’ll walk you through how to parse the JSON output data from Singer using target-csv and standardize it using a simple Python script.
Notes
The code for these examples is available publicly on GitHub here, along with descriptions that mirror the information I’ll walk you through.
These samples rely on a few open source Python packages:
- **tap-salesforce:**a Singer tap to extract data from Salesforce. More info on GitHub.
- **target-csv:**a Singer target which converts input JSON data to CSV files. More info on GitHub. We’ll use the hotglue fork which uses updated dependencies.
- singer-discover: an open source utility to select streams from a Singer catalog. More info on GitHub.
- **pandas:**a widely used open source data analysis and manipulation tool. More info on their site and PyPi.
- **gluestick:**a small open source Python package containing util functions for ETL maintained by the hotglue team. More info on PyPi and GitHub.
Without further ado, let’s dive in!
Step 1: Setup our environment
Create the virtualenv
Singer taps tend to have a lot of dependency conflicts with each other — to avoid dependency hell, I highly recommend running through this example in a virtual environment.
These commands may vary depending on your OS and Python version. For more info on venvs with Jupyter, check out this TowardsDataScience article.
Step 2: Configure the Singer tap
Get the OAuth credentials
First off, you’re going to need Salesforce OAuth credentials. This process is already well-documented by Salesforce, so I’ll assume you can follow that guide.
Create the Singer tap config
Now we have to create a Singer config. This will specify our OAuth credentials and some Singer specific settings. Their example config is of the following format:
Fill in your credentials, and save this to a file called config.json in the local directory.
Run Singer discover
The first step of getting data from Salesforce is to figure out what data is actually available. Singer taps offer a discover command which prints a JSON object describing all of this. Let’s run it now:
If this worked successfully, your catalog.json should resemble this:
Tell Singer what we want
From here, we want to select what objects we actually want to sync. We’ll use the singer-discover utility we downloaded earlier for this.
This will launch an interactive utility to select what streams(objects) you want from Salesforce. I am going to select Lead (space) and press enter. This will prompt you the option to select specific fields. I’ll accept the default and press enter.
This should give you the following output
Run Singer sync
We can now finally get the data from Salesforce using the files we’ve generated, using the following command:
This will output two files:
- the CSV containing the data from Salesforce (something like
Lead-20210128T125258.csv) - a JSON file
state.jsontellingtap-salesforcewhat it last synced. This can be fed back to the tap-salesforce in the future to avoid syncing the same data again.
Finally! We’ve pulled our data from Salesforce! Not too bad, right? If you wanted to use this in production, you’d have to automate the process of creating the properties.json and likely stick all of this into a Docker container (very similar to how hotglue and Airbyte work).
Step 3: Standardize the data
You can follow along with this part directly in the Jupyter Notebook (feel free to clone and try your own transformations).
You can check out recipes here.
Look at the data
Let’s take a peek at what tap-salesforce gave us.
Not too bad, right? Let’s load the data into a Jupyter Notebook and clean the data up a bit. For this article, I’ll keep it very simple but if you’d like to learn about other ETL operations check out my TowardsDataScience article.
Launch Jupyter
Let’s launch Jupyter
This should start Jupyter in the current directory and open the browser.
JupyterLab started
If all the setup commands worked, you should see tap-salesforce available under the Notebook sections. Let’s create a new Notebook with the tap-salesforce kernel. I am going to name mine salesforce.ipynb
Load the data
Let’s use the gluestick and pandas libraries to load the data and take a look. Our goal here is to be able to easily manipulate the output from tap-salesforce.
Clean up the data
Now that we have the data in a Panda’s dataframe, you can transform it however you like. Of course, you’re not limited to use Pandas — you could use Spark, or any other Python based data transformation tool you like.
Conclusion
Next steps
This is really just a starting point for a data integration pipeline. If you’re looking to take this further (orchestrating this on the cloud, connecting it to your product) it’s worth taking a look at tools like hotglue and Meltano, both of which aim to make data integration easier for developers.
Considerations
I recently published an article on TowardsDataScience about the pros and cons of building off Singer. I would recommend checking out Airbyte before resolving to build your pipeline off Singer.
Feel free to check out the open source hotglue recipes for more samples in the future. Thanks for reading! I’d love to answer any comments or questions below.