SAP Sybase IQ: Text Mining and Case Ignore Property

August 22, 2013

The SAP Sybase IQ Unstructured Data Analytic (UDA) option extent the capabilities of SAP Sybase IQ to do text analysis (data mining). This option allow the creation of Character Large Objects (CLOB) and Binary Large Objects columns that are used to store and manipulate binary documents (like MS Excel, MS word, etc) and long text columns (filtered content of the binary objects).

To obtain insight from those CLOB, we need to index those columns and use string functions to retrieve, compare and extract information.

A case sensitive databases can:

  1. Add complexity to the mining process by means of requiring complex queries predicates and,
  2. Give place to omissions due to possible upper and lower case characters combination (erroneous or not).

There are several options that can be use to minimize the impact of case sensitivity during data mining, let see some of them:

  1. Use every possible combination of upper and lower case in the predicates of your queries (a lot of possibilities, not recommended).
  2. Use function in the predicate of the queries to convert the content of the column to upper or lower case before using a comparison operator.

    Select * from MyUser.Mytable
    Where lcase(mycolum) like ‘%term%’

    This work well for string columns that are not CLOB; the LCASE, UCASE, LOWER and UPPER function are not supported on CLOB data type columns.

  3. Convert the pre filtered text to upper or lower case before storing it on the CLOB column; use the same case in all the predicate of your queries.
  4. Create the database with the CASE IGNORE option; this option can not be changed after the database is been created.

If the SAP Sybase IQ database will be primary used for data mining and the case can be ignored, it is recommended to create the database with CASE IGNORE property; by default all SAP Sybase IQ database are created with CASE RESPECT property.

