Skip to content

DWA: The Unstructured Data Repository

February 26, 2013

In the DW Architecture block diagram I presented the Unstructured Data Repository (UDR) as been part of the Data Warehouse Architecture.

The UDR contains data in different format, like MSWord documents, PDF files, images files, etc. It can also contain long character fields like weblogs. The data can be saved In-databases in binary or long character format, or as files (distributed file systems or regular file systems, depending in various factors not covered here). We extract the different attributes of each document (including the text and binary patters) and save those on database tables; then we created indexes on the columns of those tables (one of the most important for unstructured data are the text indexes).

Two examples of implementations are:

  1. MapReduce
  2. In-database

With MapReduce the documents are saved in a distributed file system on multiple nodes, then the content and the different attributes of each document is programmatically extracted and processed, optionally, the extracted information can be saved in a database (i.e.: HBASE) and indexed.

With In-database the documents are saved as a binary object or as long character object (columns); in both case, we programmatically extracts all the attributes of the document and saved them in other columns in the same table. If the document is saved as binary (word processing document in this example), we programmatically extract the text and saved it in a long character column, and then we indexed the table.

Read Further

The Data Warehouse Architecture

Advertisements

From → DW Architecture

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: