Managing Duplicate Files

This documentation is intended for users who need to understand the detection and handling of duplicate files in their Starter-Kit application. With the default settings of Starter-Kit, only the front-end upload allows duplicate detection at upload time using massimportitem.

 

To achieve asset duplicate detection, you will need to add a "duplicates" field of type text to the list of fields. This field will allow you to store duplicate files that are found. The massimportitem object already has this field, but no DAM object (objects with the label "#damobject").

Duplicate search metadata:

The properties used for for duplicate detection are the following one that store specific file hashes of the uploaded file, they are filled automatically upon file upload:

  • phavg

  • phdiff

  • sha

Duplicate search process :

  • Search in massimportitem and all objects labeled "damobject"

  • Search with sha equality (same binary)

  • Search by proximity with phavg and phdiff (tolerance and other options can be set by activating and configuring the WXM_SimilarHash plugin)

 

The duplicates field is then used by Portal to display massimportitem during indexing, for which potential duplicates are detected (a JSON is provided).

 

To replicate this process in other user interfaces, such as the back-office, here is a list of the steps to achieve.

  • Create a customized trigger to display possible duplicates (customize the display of duplicates)

  • New buttons to resolve the duplicate (the user can say it is not a duplicate or yes it is a duplicate but the new element is of better quality)

  • A trigger to ensure that the duplicates are bi-directional:

    • A exists

    • B is uploaded as a duplicate of A

    • B knows that A is a duplicate

    • A does not know → we can make complex queries or use triggers when an element is detected as a duplicate, with all the locking issues this generates.

  • A trigger to remove references to a deleted asset that is referenced as a duplicate by other assets (with all the locking issues this generates)

All of this is easier to manage in Portal because the massimportitem has a limited lifespan. We assume that what is in the DAM is correct (no duplicates) and that when we upload via Portal, duplicates are displayed, resolved, and what goes into the DAM is not a duplicate.

It is still possible to obtain information about duplicates via the back-office workflow, but the points raised must be addressed in the project. The first step is to copy the duplicates field from massimportitem to the collection and then decide what to implement (see complicated triggers and customized processes).

For your information, please review the documentation on DAM_Utils Metadata and AI extraction with DAM_Utils.

We recommend:

  • Analyzing the default DAM_Utils configuration

  • Referring to the documentation to understand the following:

    • duplicatesFinder

    • resourceProperty

    • Callbacks

    • workflowTrigger