Deduplicator

Using the deduplicator

The deduplicator allows you to clean up site data. You can identify sets of sites which are potential duplicates, choose which ones to keep and which to delete, and safely transfer over surveys relating to the duplicate sites to the ones being kept. A site should only be mapped once in the system but may occur more than once for example if historical site data has been imported into the system. See this guide for help on importing site data.

Identifying potential duplicates

First, identify potential duplicate sites. 
To create a query for potential duplicates you first select the appropriate variables for the four available filters: Entity type, Managed by, Created by, Created on and Admin region in order to first narrow the set of data you want to check for duplicates. Note that Managed by refers to the organization in the platform, not necessarily the party who are managing the maintenance of the site out in the world. Then select the parameter(s) you want to match by: 

  • Match by distance compares the GPS locations of the sites. 
  • Match by property allows you to compare one or more properties of sites, for example Name, Description and/or Custom ID. Note that you can select more than one property to match on. 
  • Match known duplicates allows you to enter the unique mWater IDs of two or more sites you know are duplicates as a comma-separated list.

The query will return one or more sets of potential duplicates matched by the parameters you've selected and you can go through these sets one at a time marking known duplicates and also sites which are known not to be duplicates of each other.

Selecting Show non duplicate sites will display any sites marked as non duplicates in the dataset you are currently using. Otherwise these sites that have already been marked as non duplicates will be hidden so that you do not need to pass through already deduplicated sites.

Note that non duplicates are not marked as non duplicates for everybody but for only a given dataset. By default the dataset is your own personal one, therefore others will not see the sites you mark as non duplicates. You can manage datasets from the change button on the top right and share your dataset with others. Note that you need to select this dataset each time you come to the deduplicator as it defaults to the personal dataset. See below for more details. 

Marking sites as duplicates

Once you have a run a query and potential duplicates have been found, you will be shown the first set of possible duplicates, one per column. If you identify one or more of the sites as duplicates on the basis of properties such as the name, location or image, select one as the Main site to keep. Then select all the sites that are duplicates by clicking the Duplicate button. Mark any sites which you know are not the same as Not duplicate. Clicking these buttons will not do anything in the system until you click Merge and you can also not click any button for a site.
Comparing two sites before merging
Comparing two sites before merging

Merging sites

When you click Merge, the duplicate sites will be deleted from the system entirely and all the surveys attached to the duplicate sites will be attached to the main site instead. If you want to keep any of the properties on sites you mark as duplicate, you must move them manually across to the main site before merging. They will not be moved automatically with the Merge function. Sites marked as Not Duplicate will not be deleted but the system will remember that the Main sites and the Not Duplicate sites are not duplicates of each other. These will no longer show up in normal queries as potential duplicates of each other. If no button is selected for a site, then nothing will happen to the site and it will not be marked as a duplicate of other sites in the set. 

Note: Even if you only have Main site and Not duplicate selected, remember to click Merge to make the system remember that these sites are not duplicates of each other.
What happens when you click Merge
What happens when you click Merge

Datasets

Datasets keep track of which sites have been marked as non duplicates of other sites. They allow multiple people to work on the same deduplication effort without needing to mark sites as non duplicates more than once. 

As soon as you have marked any sites as non duplicates then your personal dataset will be created. You can then change permissions on this dataset by clicking the Change link in the top right of the deduplicator screen. Here you can see which datasets you have created and which you have created. You can also create a dataset specifically for a purpose, for example for the deduplication efforts of a given geographical area, team or timespan, and name it accordingly.

By default each time you open the deduplicator your personal dataset is selected, so remember to select the right dataset when you begin deduplicating to make sure you add any marked non duplicate sets in the right dataset. 

Note that sites that are merged are merged for everyone, not just for people who are part of a dataset.
For more information, please email info@mwater.co. General advice on how to get started with mWater can be found here.