Using Tableau’s Data Tools for Data Prep and PDF SourcesBy: Greg Herzing | September 7, 2022
Data visualization is fun, interesting, and beautiful. Data preparation, on the other hand, is none of these things. Data can provide valuable insights to the organization, but only once it has been extracted from wherever it lives, cleaned up, and shaped into a form that can support visualization.
Experienced analysts will tell you the data we need is often found in hard to reach places such as multiple, unconnected databases. It is often messy with duplicate records, missing data, and formatting that make it difficult to parse. Far from Tableau-ready. The Extract, Transform, and Load (ETL) process, the formal name for data preparation, is regularly 60-80% of any analytical project.
One place where good data used to be able to hide in plain sight is the ubiquitous Portable Document Format (PDF). These documents are designed for human, not machine, consumption but they may represent an excellent source of otherwise hard-to-find data. Prior to the advent of the PDF data connector, in version 10.3, using a PDF as a data source meant copying each table from a PDF into a spreadsheet or other database, maybe manually.
Just as the engineers at Tableau have added new data connectors with practically every release, they have also worked hard to build ETL tools into Desktop. In this post, I’m going to show you how to use some of these tools to get to this hard-to-reach data and make it usable. For the purposes of this demonstration, I have created a PDF version of the Coffee Chain Quarterly Sales dataset. We will connect to the PDF and reshape the data to create a working data source. Then we’ll visualize it.
Step 1: Connect to the PDF
1. Select PDF from the “To a File” list in the Connect pane on the Start page. Tableau will open the Scan PDF File dialog box.
2. Choose the page where your data is located. You can select a page or range of pages if your data table runs into multiple pages.
3. Select OK to open the Data Source page. The Data Source page will open with a list of tables present in the selected range.
Tableau recognizes the pages in a PDF and even separate tables within a page. When you connect to your PDF, each table will be listed in the Tables pane in the Data Source page. This allows you to select exactly the data you need to analyze. Note that Tableau uses optical character recognition (OCR) to read the values in the document into its built-in database. This is accurate but not foolproof. Be sure to check that columns have been partitioned properly and their values rendered accurately.
Now that we have connected to our data, we need to choose which table to analyze. We will use Tableau’s View Data ability to make our choice.
Step 2: Examine the Data
1. Hover over a table under the Tables pane. Tableau will highlight the table and show the View Data button.
2. Select the View Data button to display a preview of the data.
Looking over the data, we can see that Page 1 Table 1 is a bit messy because it includes all tables and headers in our report. Page 1 Table 2 is the last table, which contains totals for each quarter. Since we can calculate totals in Tableau, we do not need this table. The remaining four tables contain our data.
Now that we have connected to our data and found the required sources, we can prepare our data for analysis. Since we are working with multiple, similarly formatted tables, we will union them to create a single, complete data source.
Step 3: Prepare the Data
1. Drag the Page 1 Table 3 table to the Canvas on the Data Source page.
2. Drag the Page 1 Table 4 table directly under the Page 1 Table 3 data source on the Canvas. The words Drag Table to Union will appear and will highlight orange when the table is close enough to union.
3. Drop the Page 1 Table 4 table on the orange Drag Table to Union panel. The table image will now change to indicate a union.
4. Repeat this step until Tables 3, 4, 5, and 6 have been added to the union. To verify that all four tables have been added, right-click on the data source in the canvas and select Edit Union. You should see four tables, as shown below.
5. View the data preview shown in the Preview Data Source pane.
Notice how the data is formatted. Each cell value is the intersection of a month in the column header and the region on rows. This structure repeats our four regions four times, once for each source table, which means most of the table values are null. It also prevents us from using Date as a data field, since each month is rendered as a separate field. We will want to pivot the date fields, so that all the months appear in one column and their associated sales values in another.
6. Select the column heading for the first Date column (1-Jul).
7. Hold down the Shift button and select the column heading farthest to the right. Tableau will highlight all the columns in between.
8. Right-click on any of the highlighted columns. Tableau will show a menu.
9. Select Pivot.
You now have four data fields: Sales by Quarter, Table Name, Pivot Field Names, and Pivot Field Values. Pivot Field Names and Pivot Field Values are created when Tableau rearranges and aggregates the selected columns.
10. Complete data preparation with the following cleanup steps:
a. Right-click the header to rename the Pivot Field Names column, “Date”.b. Rename Pivot Field Values to “Sales”.
c. Rename Table Name to “Quarters”.
d. Right-click on Quarters and select Aliases…
e. Change the names of each table to Q1, Q2, Q3, and Q4 respectively, to match the data they represent.
f. Rename Sales by Quarter, “Regions”.
g. Switch to a worksheet.
The data fields are now listed in the Data pane, but Sales is a discrete dimension with a data type of String.
h. Right-click the Sales field and Change Data Type to Number(decimal).
i. Right-click the Sales field again and select Convert to Measure. The Sales field is now a continuous measure.
In just a few minutes, we have taken the raw data from several pages of a PDF and transformed it into a single data source ready for analysis. Our final step is to build a visualization that delivers insights from our newly rationalized data source, and you know have the data source to build that visualization.
Quick Tip: The Data Interpreter
The Data Interpreter works ETL magic on file-based data. This tool can clean up Excel workbooks and PDFs by removing extra headers and spaces, and shaping data to support visualization.
1. Connect to a file-based data source.
2. Check the Use Data Interpreter checkbox. Tableau will clean the data and show a link for “Review the Results.”
3. Select “Review the Results” to open an Excel file generated by Tableau.
The Results report shows what changes Tableau made to the file, while also giving you a processed file which itself could be used as a data source. The image on the left (below) shows the key to interpreting Tableau’s changes to the file on the right (below). Both sheets are included in the Results report.
We have now built a high-level visualization from a human-readable report using Tableau’s built in data prep tools. These tools are great for working with PDF data, difficult spreadsheets, and any data source where we are not able to change the source data. We have seen how easy working with data in Tableau can be and are now better equipped to extract data from hard-to-reach places and turn it into a data source we can build on.
Subscribe to our resources!
Sign up to receive our latest eBooks, webinars, blog posts, newsletter, event invitations, and much more.