Exploit unstructured content in your applications
Getting value from data created for human – not machine – consumption
Unstructured text holds immense business value, but that value is hard to attain without the right tools. We’re talking electronic and printed documents, e-mail, Web pages and social postings, chat and message text, health records and corporate reports. This content was created for human rather than machine consumption. When in electronic form, this content is managed by operational systems whose job is to move bits rather than to decode, read, and make sense of the content, whether it is a corporate financial report, an Amazon review, or a call-center transcript.
Software is eating the world
Marc Andreessen said that “software is eating the world.” In the content world, powerful natural language processing (NLP) tools and text analytics solution pull entity mentions, facts, topics, events, and opinion (sentiment) from source text. Established NLP relies on language rules and statistical analysis. Add machine learning to the mix to boost information discovery, especially in social media, that uncovers new patterns and features, and facilitates work with text in the broad set of human languages.
The first challenge
This bit of software magic has no value if the source content is in a form your application can’t access, for example, if it’s in an e-mail archive, Word or PDF document, HTML page, or an image file representing a scanned paper document. So the first challenge is to identify the source format and extract both descriptive metadata and the text that’s the body of the document.
Software developers need to code using a multi-step process:
- Data acquisition: from online, social, or corporate systems.
- Ingestion: file type identification and metadata and content extraction.
- Information extraction: entities, topics, facts, relationships, and sentiment – as well as functions such as disambiguation and entity resolution.
- Text analytics: descriptive statistics and predictive modeling steps including classification and mining for linkages and associations – integrated (when appropriate) with analysis of transactional, behavioral, and sensor data.
The four essential elements of software
Really, there are few limits given the power of modern computing hardware, ranging from devices to cloud platforms, and the flexibility of software toolkits and frameworks. When it comes to software, four elements are essential: Feature set, accuracy, usability, and performance.
For software developers, usability and performance mean a well-documented application programming interface (API) to capabilities provided by a Web service or code library, quickly and reliably and preferably via a software development kit (SDK) for the developer’s coding environment.
The importance of accuracy
When dealing with text, accuracy has two components, recall and precision. To be accurate, all salient text information must be identified (with few false positives) and must be identified correctly.
Meeting the expectations of consumers
Although all four software elements are equally important, the feature set often stands out. Users, whether consumers or business users, expect an application to pull in all relevant data regardless of its source, form, or type. They expect context-aware answers that are appropriate to the situation at hand.
These expectations include the ability to integrate text-sourced data, extracted from social, online, and enterprise sources, with profiles, transactional records, behavioral data (for instance, from clickstreams and geotracking), and demographic and reference data. Text metadata extraction is key for author, creation and modification dates, title, topic, and keywords, for example. Metadata extraction provides the data provenance, descriptions, and context that applications require, both operationally and for descriptive and predictive analytics.
The result is unparalleled ability to exploit unstructured content in cutting-edge business and consumer applications.