Monday, 7 March 2016

Dirty data: hands-on guidance for the IP community

This weblog is pleased to host another guest contribution from Donal O’Connell (Chawton Innovation Services Ltd). Donal, who is also a consultant to Aistemos, has been giving thought to what he calls "dirty data" and its implications for businesses, strategists, advisers and investors in te field of intellectual property.  This is what he writes:
Dirty IP Data 
Intellectual property rights are valuable assets for any business, possibly among the most important that it possesses. It is therefore imperative that the associated IP data is also treated with the respect that it deserves.

Data integrity is data that has a complete or whole structure. All characteristics of the data including business rules, rules for how pieces of data relate to each other, dates, definitions and lineage must be correct for data to be complete. This paper explores the issue of problems with data integrity within an IP data management system, and applies equally to systems that reside within either a corporate environment or private practice.

IP data that has integrity is identically maintained during any operation on the IP System, such as data entry, data transfer, storage or retrieval. Put in simple business terms, IP data integrity is the assurance that the IP data is consistent, certified and can be reconciled. Dirty data refers to the lack of data integrity to one degree or another. 'Dirty data' is a term used by information technology professionals when referring to inaccurate information or data, and this term 'dirty data' will be utilised throughout this post.

The definition of 'dirty data'

Dirty data can have a variety of meanings:
• Missing data
• Incorrect data, wrongly entered to the tool
• Incorrectly formatted data
• Data entered into the wrong field
• Stale data, that was once correct but is now out of date
• Missing links such as the relationship between the data in two or more fields
• Duplicated data, where the data exists in more than one place
All of the above are valid and these examples qualify as dirty data. To summarise, dirty data can be incorrect, lacking in basic/general formatting, incorrectly spelled or punctuated, entered into the wrong field or duplicated, all of which will make the data generally misleading.

The root causes which lead to data becoming dirty 
There are a number of possible root causes of dirty data:
• Migration errors
• Data entry errors
• System design errors
• Synchronisation problems
• Data reporting problems
• Maintenance problems 
Migration is where data is transferred into an IP system from another systems, perhaps as a result of a system upgrade or as a result of M&A activity, where data has been transferred and incorporated from an external IP system. If the data is dirty before the migration, it is likely to remain dirty after the migration unless concrete steps have been taken to address the problem.

Data entry mistakes can be made by IP personnel within the organisation, by non-IP personnel within the organisation who are given access to the IP system and by external IP personnel who have been provided with access to the IP system. A certain amount of human error is inevitable, but what is the solution when the mistakes are constantly occurring, the fix would make an auditor cringe and the person or persons making the errors are taking zero responsibility, while blaming it all on the system?

IP system design and implementation errors can lead to dirty data. However, good system design can for example help to greatly reduce data entry errors, by focusing on such issues as catching exceptions, formatting, buffering and the way in which choices and selections are provided to the user.

Synchronisation, in this instance, is the maintenance of one operation in the IP System, in step with another step in another system to ensure overall data integrity. 
Synchronisation challenges with other company systems can lead to problems with the data as it is not uncommon for the corporate IP System to be linked electronically with other corporate systems in the company, used for example by HR or Finance. Combine this with systems possibly belonging to an IP Renewals/Annuities Payment provider and/or belonging to an IP Agent network and your synchronisation challenges can be even greater.

Creating reports using the data can itself present the problem of dirty data within the actual reports, if there are errors with the scripts or problems with the reporting functionality of the system. It can also be due to lack of understanding of the data structure within the system. The data within the IP data management system may not be being properly maintained. If data within the system is not being updated on a regular basis, as it should be, this can lead to dirty data problems within the system.

I should add another dimension related to maintenance. When reviewing and cleaning data from the jurisdictional IP databases (USPTO, EPO, etc.) certain dirty data issues may be identified, and the information contained in the jurisdictional data bases may be inconsistent with what a company's IP function or what an IP firm believes. For example, a maintenance payment may have been missed and the patent subsequently may not have been renewed.

More importantly, the patent may have been purchased but not re-assigned or assigned for security interest to a bank but not updated in the jurisdictional data.

So, there are several causes of dirty data. 
Where is the 'dirty data'? 
Dirty data can exist in the data fields associated with any of the key IP process areas such as IP creation, IP portfolio management, IP enforcement, IP exploitation and IP risk management. 
Problems can be linked to data fields used in the front end, for example in the patent creation process from inventor and invention report, through to Patent Committee or Patent Board decisions. Problems can also exist in the data fields used in the actual patenting process from drafting, to first filing, or foreign filing and through prosecution through to granted patent.

Dirty data can also occur in the IP portfolio management process in data fields used during management the IP assets, and in the IP utilisation phase in data fields used in licence agreements and contracts.

Dirty data can occur with any form of IP from patents, trade marks, copyright, designs, trade secrets, etc.

Why is 'dirty data' an issue for IP?

If it exists, then dirty data is a serious issue for any corporate IP Department or any IP Agency as it can lead to liability issues or a loss of rights. The 'rules' may not run for example for the proper creation of patent families, key dates may be missed or the wrong data may be sent to the IP Office. Correspondence may be sent to the wrong person or IP reports with incorrect data may be created and used in the decision making process. IP data is ultimately used for IP management purposes and will be utilised for well informed decision making. Dirty data may lead to the wrong decisions being made.

Why is 'dirty data' an issue outside of IP?

IP data is utilised not just by the IP department and IP data is most important in business, as far as technologies, products and services are concerned as it forms an integral part of many legal agreements and contracts. IP data is more frequently being reported to, and utilised by, Senior Management within the Corporation so 'dirty data' in IP can adversely impact activities and decision making outside of the corporate IP dept.

Cleaning up the data

Firstly, some understanding is needed of how serious the problem is with 'dirty data'. Questions to ask include how and why it has occurred and where is it happening? If the challenge with dirty data is large, then what is the prioritisation? Only when all the previous questions have been considered should the clean-up exercise be undertaken. Cleaning the data may involve using dedicated IP Service Providers and/or developing some automatic scripts and tools. It will almost definitely involve some manual hard work.

A three stage process is strongly recommended:
• Corrective actions to fix any problems
• Understanding of the root cause
• Preventative actions to stop problems repeating (processes, systems, education, checks)
'Dirty data' cannot be tackled in isolation

Data quality issues cannot be tackled in isolation. Data quality is interlinked with the IP processes or ways of working which are adopted in the company, the IP systems and tools in use, various legal matters and of course the actual people involved. Last but not least it involves management and leadership.

Best practices

A number of best practices exist to help address dirty data issues within an IP System:
• Control the data entry
• Define mandatory and optional data fields properly
• Assign rights and roles both for IP and non IP personnel with access to the system
• Assign personal responsibility
• Keep a change history
• Design 'intelligent' data fields
• Use tools to measure and clean the data on a regular basis
• Make data management a living process
• Measure, measure, measure.
The best approach is to make data quality management an on-going process and an integral part of IP management within the organisation.


To address dirty data problems properly within an IP system, it is important to adopt a recognised iterative four step problem solving process. 'Plan, Do, Check, Act'.

This first step is to evaluate and analyse the problem thoroughly and decide if, what, where and how dirty data is a problem and what needs to be done to rectify the situation. The second step involves making the necessary improvements, often on a small scale initially. The third step involves checking the situation and comparing actual results versus planned results. The final step is to analyse the differences to determine their causes.

When your dirty data challenge has been addressed, it is most important not to just forget the problem and move onto the next issue. Metrics should be defined, agreed and implemented and regular data reports created so that you know precisely the situation with your data integrity going forward and so that you can react quickly if things go amiss again in the future.

As stated at the beginning of this post, rights are valuable assets for any business, possibly among the most important it possesses. It is therefore imperative that the associated IP data is also treated with the respect that it deserves, and that any dirty data challenges are tackled and resolved.
As a footnote, Aistemos reminds readers of its continued support for the ORoPO project, which seeks to promote the importance of an open register of accurate and reliable patent ownership records.  For information concerning ORoPO and earlier Aistemos blogposts on this subject, just click here.


  1. Thanks Donal. I'm convinced that dirty data is a bad thing, but in terms of cost of putting all of this into operation versus the risk of damage from dirty data, I could still do with a little persuasion. My business must have lived with dirty-ish data since it was launched, but doesn't seem to have hit any problems yet.

  2. Great post as always.

    Here's a multi-jurisdictional copyright law/royalty (supply chain) forensic view.

    There is an element, a thread perhaps, that is conspicuous by its absence – that being the vagaries of human nature and the behaviour of the corporate psychopath when inadequate checks and balances are not in place to “check” the “checkers” so to speak.

    So in the definition of dirty data one might consider that anyone of the following could form ‘dirty data’:-

    · Deliberately altered/amended data

    · Distorted data

    · Deleted (dev-null) data (so long as the databases are programmed to track such)

    · Error report data

    · Filtered data

    · Manipulated (by function programming) data (think of rounding up or down)

    The root causes would also consider including

    · Inadequate programmer checks and balances

    · Inadequate programme checks and balances

    · Inadequate data checks and balances

    · Hacking

    · Internal abuse of system access

    · So-called broom code (fixit code or manipulate database or data code, all the way to ‘pick up from the last programmer left’ code)

    Cleaning up the data might include inter alia

    · An audit of the database engine (transaction tracking system with rollback)

    · An audit of any algorithms deployed

    · DC (data capture) checks and balances

    · DI (data integrity) checks and balances

    · Data mapping – a schematic, including any IP addresses where data is stored and located, of the data and the underlying technology and programming, scripts, certificates

    · Data security review

    Best practice, in addition to all the points raised which all are relevant, might consider addressing jargon such as

    · Data capture

    · Data integrity

    · Data audit procedures

    · Underlying database programming code (including unit testing) , structure and technology

    · Vetting and checking, overtly and covertly of the programmers and coders, including skill and capacity tracking

    · Data (server) location and storage

    Lastly, much more can be made of “measure, measure, measure” – such is so important – critical. It takes one into the field of measuring what others are measuring.

    All the best with the great work coming out of Aistemos.