Normalising and Sorting Data in Arbutus Analyzer

Data normalisation can be a challenge, particularly when you need to normalize your data before performing fuzzy duplicates tests. To most effectively normalize and sort your data, Arbutus Analyzer offers a SortNormalize function which performs both commands in one step to match data to set conventions or standards.

To use the SortNormalize function, follow the following steps.

  1. Prepare your normalization substitution table, and ensure it is saved in the same directory as your Arbutus project so that it can be read.

  2. Right click on the column headers in your chosen table and select Add Column. Click Expression to open the Expression Builder.

  3. Under the Functions menu, select the SortNorm function.

  4. Fill in the chosen field name and substitution table directory.

  5. Name the output table.

  6. Click OK.

In the example below, a U.S. postal text (.txt) file (downloaded from the U.S. Postal Service website) is applied to the Address field in a Vendor table. The addresses, which have been entered in various different ways, are normalized according to the U.S. postal text file.

The normalization substitution table which will be applied to the Address field is shown below.

The SortNormalize expression will normalise the Address field to the Address_Substitution table and will also sort components of the data entry in descending order (for example, “123 Main St. West” and “123 WEST MAIN ST” are both normalised as “WEST ST MAIN 123” which renders them an exact match).

First, prepare and save the file in the same directory as your Arbutus project. Right click on the project view and select Add columns, the click Expression to open the Expression Builder.

Under functions, select Fuzzy Search to find the SortNormalize() function. In the brackets, fill the <field> with the Address field by double clicking the Address field name under Available fields. Next, enter the directory of the saved substitution file on your machine between quotation marks. In the example below the substitution table is saved under L:\Arbutus\Demonstration files\DATA\Fuzzy Duplicates\Fuzzy_duplicates\Address_Substitution.txt.

Check the expression is valid. Validating the expression will also prompt Analyzer to identify if the file cannot be found i.e. if the directory has been entered wrong or if the file is saved in another location.

In this example the new SortNormalize() address column is labelled “address sortnorm” and is shown next to the original Address column. Now, all addresses are normalised, sorted, and ready to be tested for duplicates.

47 views0 comments

Recent Posts

See All