Processing Unformatted Data

I recently began a project to import data from unstructured documents. I have done this before, and it can be challenging.
Unlike a technology such as IBM Watson, that can read prose and other unstructured data, I am speaking of data that is only structured by its formatting. Thus, it is not simply unstructured but rather not meant to be structured as a JSON result would be. Think HTML, PDFs, or scanned documents.

There is nothing really new here, but it is interesting to see the different ways that data can be represented and how it can be matched.

I will discuss several ways that the data can be processed. The first step is identifying where the data is located on the document. Second is creating a map or a heuristic that can be reused for each document to identify the data. Third is extracting the data using the map and processing it into a known format that can be used programmatically. Forth is validating the data against expected values and reporting on any problems with the data. Fifth is actually placing the data in its destination.

Identifying Data

The first thing to do is to look at the document structure and identify source data. You also need to identify the varying ways that data can be represented in the source document and document that also. For example, lets say an order comes in as either pre-paid, cash, or balance due. There may be a section of the source document that says Prepaid, Unpaid, FOB: Sender, COD, etc. You would look at many samples of the data and decide where this data can be found.

You will also need to determine how you will extract the data from the source document. You may want to just read it, based on some identifier, and then put it where it goes in your system. You may need to pre-process it, such as using OCR to read characters. If you have access to cognitive services, you may want to train them to read the data using input and sample output. You may have to use identifiers, such as labels, or even position in the document or on the page to identify the data. It will all depend on what your source document looks like and how much formatting is already there.

Sometimes it is useful to use an intermediate format. You can take unformatted plain text or PDF data and convert it to XML or JSON. This provides for separation of concerns and decouples data validation and import from data recognition.

Mapping Data

Next, you will map data by looking at the identified data and then figuring out how you want to put it in your system or database. For example, for a payment record, you may want to record an amount due, or you may want to record a payment of a specific type (Visa, Cash). You may want to record payment due based on COD and shipping amount due based on FOB (free on board sender, sender doesn???t pay shipping).

Once you have this map, you will need to codify it. Put it in a mapping table or write some code. It???s often a useful exercise to map out the data in Excel or some other format that is easy to visualize.

There are tools that let you map data from a source into a database like SQL Server. These tools include SQL Server Integration Service and Azure Logic Apps. There are other companies that have other tools such as Informatica and SAP, but these Microsoft tools are easy to start with and fairly common. When you use these tools, you simply have to extract the data and them map the data to your final destination. They are a real time saver.

Extracting the Data

Once you have a way to extract the data, you then implement the code to extract the data. For example, if the source data is HTML, you can use an HTML library to read the data. Similarly with PDFs, you can use a library like Apache PDF Box. iText is a commercial library that is good for extracting data from PDF files.

One way to map the data is to read the map, for example an XPath to an HTML or XML node, and then read the data at that location.

Validating Data and Reporting Problems

Now that the data has been extracted, it needs to be validated. The data should be validated for type and for range. For example, a birthday should be validated to be a date in the past. Decimal values for money should be validated and/or rounded as necessary.

Also, you need to validate whole entity data. For example, what if you receive an order record with no items? It may not make sense and the entire input may be rejected.

Placing Data in the Destination

If you???ve done everything above, you should be ready to insert your data. The data should be clean and sanitized and ready for insertion.

What Else?

Going forward, you would just implement the data import. The steps above aren???t that much different from users entering data into a web app. The main difference is that this process is usually fully automated and unattended. If there are validation errors, they need to go into a log and be sent to someone as a message.

There are other data types to consider also. What if you want to import image data? You would need to figure out how to separate the images in the input and then put them in storage. This could also extend to text to speech or even video.

Also, since machine learning and cognitive services are not discussed, that gives you yet another avenue to research in processing unformatted data.

I may come back to this topic in the future and give examples of each. If you need to import some unformatted data, try this process and let me know how it goes.