3 Ways to Eliminate Bad Data

3 Ways to Eliminate Bad Data

I’m privileged to work for a business unit that was one of the first in our company to have a web presence. The upside is that we have lots of history to build upon and learn from. That also happens to be the downside. I wrote before on how we are reducing our technical debt, though lately I ran into an issue that seemed worthy of mention in its own right: bad data.

We’re in the midst of a multi-phase project transitioning to a data model that more accurately fits our business today and eases potential integration with a CRM solution. Not surprisingly, this means some fundamental changes to our database structures and massive migration scripts. It was while working on one of these scripts that I noticed lots of inconsistent and bad data in our ten years’ worth of records. Fields supposed to hold phone numbers contained text notes, person entities had empty strings for names, etc.

Knowing hindsight is always 20/20, we like to help people learn from our missteps. So here are three ways to eliminate bad data:

  • Utilize database constraints when applicable. The simplest solution to some of our problems would have been to require the right information at the database level. Correctly setting up foreign keys, non-null constraints and text field limits goes a long way toward keeping your data pure.
  • Don’t let bad data in. It sounds simple, but as any developer knows, keeping bad data out of your tables is a challenge. Thankfully, there are many good off-the-shelf validation libraries ready to handle that at every turn. Is a field an email address? Enforce the format according to the spec. Is it a name? Then an empty string is probably not valid. This principle becomes especially important when enforcing business rules. Does every person need to have both a billing and a shipping address? That belongs in centralized logic. Cover your bases, especially on public-facing forms.
  • Know when to let go. If bad data has made its way into your system, it will eventually rear its head, whether by breaking reports or causing obscure errors. Though it can be hard, sometimes all we can do is clean house and move on. It’s probably wise to archive records on the off chance they are needed later, but if bad data is breaking your systems, it is likely already past its expiration date. The workarounds needed to accommodate it are an irritating form of technical debt.

When the time comes, making adjustments will be much easier if you know exactly what data lies in the depths of your tables.

What are some best practices you implement to keep bad data out? How have you successfully done migrations when having to work around it?

    • Matthew Pfluger

      Great article, Ty. u00a0I can certainly appreciate this sentiment.nnI think all three of these points are extremely important, particularly the one to not let bad data into your system. u00a0We’re undergoing a lot of pain and resolving technical debt now as a result of NOT performing extensive validation on several of our legacy applications.nnAlso, with respect to technical debt and legacy app support, if you have applications built usingu00a0technologiesu00a0that do not integrate well with validation libraries or techniques, that may be one of the most convincing arguments to upgrade to a new platform or language. u00a0That could also be a reason to “let go” of the data as well!nnWould you mind sharing one or more of the open source libraries your team uses to perform data validation? u00a0I’d appreciate it.nnThanks!nMatthew

      • tydelong

        Matthew,nnThanks for the feedback. u00a0I haven’t dug into open-source validation pieces other than jQuery Validate – http://docs.jquery.com/Plugins/Validation.nnI was referring more to the basic validation solutions most modern languages have bundled (.NET, PHP, ActiveRecord in Ruby, and even some in ColdFusion). For Java, it looks likeu00a0Apache Commons Validator is a proven open-source solution – http://commons.apache.org/validator/nnWe actually have a great home-grown ColdFusion solution that handles both server-side and client-side validation (via produced JavaScript). u00a0I feel your pain about legacy app support; older parts of our applications are missing dependencies for our validation framework, so we have to roll our own in those areas. u00a0As you suggest, that’s one of the catalysts we have for wanting to move to a better platform :) nnWas there a language in particular where you were looking for an open-source validation library?

    • Richie Rump

      Don’t forget CHECK constraints. This could also help the query optimizer on certain db platforms to filter predicates without even executing the query.

    © 2014 Lampo Licensing, LLC. All rights reserved.

    www.daveramsey.com