Data deduplication with AI: Entity Resolution – Automated data cleansing for greater efficiency

Entity Resolution is an important procedure to avoid multiple entries of data records. We rely on embedding models that use AI to convert the vectors.

Request AI experts now!

Entity Resolution: The problem

Multiple entries can occur in a data record where the same person or object is listed several times. A common example is a list of employees in which the same person is listed under different variants.

Let’s take the head of department “Ms. Erika Mustermann” as an example. She may be listed under different entries such as “Erika”, “Ms. Mustermann”, “e.mustermann@firma.de” or “Head of Sample Department”. Although all these entries refer to the same person, they appear individually.

In order to filter these multiple entries, the data records must be clearly assignable. To do this, a list must be created in which the duplications are filtered into individual, unique data records.

Multiple entries lead to higher operating costs due to redundant data processing, impair data analysis and cause unnecessary work time for cleansing. They can also lead to confusion, for example when different departments access inconsistent data or when wrong decisions are made based on incomplete information.

Example from everyday life

We assume a customer who needs to classify products from different e-commerce websites into the correct categories in the course of a purchase or for integration into a central product database. For example, a product could be listed in the category “Devices > Pneumatics > Control devices” on one website and under “Pneumatic control devices” on another. These different categorizations must be correctly merged to ensure that the products are displayed consistently and correctly everywhere.

How is this implemented?

There are several approaches, but our focus is on mapping by vector similarity. In this process, the different data sets are converted into vectors using an embedding model. On this basis, the cosine distance to each other, i.e. the angle created between the vectors, is then compared to measure the similarity of the data sets. In this way, the ‘sense’ of the data sets is compared and a numerical result is determined.

Entity Resolution – Your benefits

Saving resources

The automated approach primarily saves working time

Immediate effects

After development, the process can be used directly without the need to release additional resources.

Plannable fault tolerance

The data deduplication process provides consistent, predictable fault tolerance.

Automation

It enables the automation of processes that were previously not feasible due to high costs. 

Then let us advise you now!

    Get in touch with us




    All fields marked with * are required for ordering and processing. Your person-related data will be used by us solely for the purpose of processing your inquiry according to our privacy policy.

    MORE ABOUT US?

    Find out everything important about Medienwerft – experts in brands & ecommerce for over 25 years – here:

    About us
    Kontaktblock

    CONTACT

    „Behind every impressive online solution there is a well thought out technological concept. My team of system analysts, database experts, front-end developers, back-end professionals and experienced designers ensure that everything runs smoothly. Feel free to talk to us.“

    Frank Meier

    Managing Director
    FRANK MEIER

    Tel: +49 40 / 31 77 99-0
    Email: info@medienwerft.de