Skip to content

SAP Sybase IQ UDA: Full Text Search by Examples

August 29, 2013

The Unstructured Data Analytics (UDA) option extent the capabilities of SAP Sybase IQ to perform Data Mining.  Here you will find examples of queries that are possible using the UDA option. You will need a UDA license to run these queries.

For these examples I created a table as follow:

CREATE TABLE DBA.MyReport ( 
MyRptID integer NOT NULL,
MyRptText CLOB NULL,
 PRIMARY KEY(MyRptID)
);

The table was populated with 1.7 millions of records, the MyRptText column was populated with clinical laboratory results saved in text format. The database was created with CASE RESPECT.

Two (2) indexes and one(1) text configuration object were created, a text index using the default_char configuration object, and a NGRAM text index using a new text configuration object.

Create text index RptTextIdx on DBA.MyReport 
(MyRptText) configuration default_char;

— Create a new configuration object

CREATE TEXT CONFIGURATION myFuzzyTextConfig FROM default_char;

— Change to NGRAM and set max length to 5

ALTER TEXT CONFIGURATION myFuzzyTextConfig
TERM BREAKER NGRAM;

ALTER TEXT CONFIGURATION myFuzzyTextConfig
MAXIMUM TERM LENGTH 5;

CREATE TEXT INDEX myFuzzyTextIdx
ON DBA.MyReport (MyRptText) CONFIGURATION myFuzzyTextConfig;

All the queries were successfully executed, the results are not included in this document. No performance data is provided, but all of the queries executed in 1 minute or less.

Use Case 1: Find the 10 most relevant records relative to a term using the calculated score from the CONTAINS clause.

select MyRptID, ct.score, MyRptText
FROM MyReport CONTAINS (MyReport.MyRptText, ‘Difficile | DIFFICILE’) as ct
where ct.score > 10
order by ct.score desc;

Use Case 2: Search records containing a phrase (can also be done with LIKE)

select count(*)FROM MyReport 
CONTAINS (MyReport.MyRptText, ‘Clostridium difficile toxin A/B DETECTED’ ) as ct;

select MyRptID, MyRptTextFROM MyReport 
CONTAINS (MyReport.MyRptText, ‘Clostridium difficile toxin A/B DETECTED’ ) as ct;

Use Case 3: Search records using Boolean expression

select *FROM MyReport 
CONTAINS (MyReport.MyRptText, ‘”Clostridium difficile toxin A/B DETECTED” and “ER”‘ ) as ct;

select
FROM MyReport CONTAINS (MyReport.MyRptText,‘(Clostridium | difficile | DIFFICILLE) AND DETECTED’) as ct;

select
FROM MyReport CONTAINS (MyReport.MyRptText,‘((Clostridium | difficile | DIFFICILLE) AND DETECTED) AND NOT (” by PCR”)’) as ct;

select
FROM MyReport CONTAINS (MyReport.MyRptText,‘((Clostridium | difficile | DIFFICILLE) AND DETECTED) AND (” by PCR”)’) as ct;

Use case 4: Find terms in proximity

select
FROM MyReport CONTAINS (MyReport.MyRptText,‘difficile NEAR blood’) as ct;

select
FROM MyReport CONTAINS (MyReport.MyRptText,‘difficile NEAR negative’) as ct;

select
FROM MyReport CONTAINS (MyReport.MyRptText,‘cholesterol’) as ct
order by ct.score desc;

select
FROM MyReport CONTAINS (MyReport.MyRptText,‘gallbladder BEFORE cholesterol’) as ct
order by ct.score desc;

select MyRptID, ct.score, MyRptText 
FROM MyReport CONTAINS (MyReport.MyRptText,‘cholesterol NEAR ultrasound’) as ct
order by ct.score desc;

select MyRptID, ct.score, MyRptText 
FROM MyReport CONTAINS (MyReport.MyRptText,‘cholesterol NEAR[2,5] ultrasound’) as ct
order by ct.score desc;

Use case 5: Find terms using prefixes

select count(*) 
FROM MyReport CONTAINS (MyReport.MyRptText,‘Urin*’) as ct;

select count(*) 
FROM MyReport CONTAINS (MyReport.MyRptText,‘Urin* AND NOT Nega*’) as ct;

select count(*) 
FROM MyReport CONTAINS (MyReport.MyRptText,‘Urin* AND Aty* AND NOT Nega* ‘) as ct;

select count(*) 
FROM MyReport CONTAINS (MyReport.MyRptText,‘Urin* AND Aty* AND NOT Nega* AND blood ‘) as ct;

select MyRptID, ct.score, MyRptText 
FROM MyReport CONTAINS (MyReport.MyRptText,‘Urin* AND Aty* AND NOT Nega* AND blood ‘) as ct
order by ct.score desc;

Use case 6: Search for term similar to a given term (FUZZY SEARCH)

select count(*) FROM MyReport 
 CONTAINS (MyReport.MyRptText, ‘FUZZY “difficile”‘ ) as ct;

select count(*) FROM MyReport 
CONTAINS (MyReport.MyRptText, ‘difficile AND FUZZY “DETECTED”‘ ) as ct;

select count(*) FROM MyReport 
CONTAINS (MyReport.MyRptText, ‘difficile AND FUZZY “DETECTED” AND NOT DETECTED’ ) as ct;

select top 100 * FROM MyReport 
CONTAINS (MyReport.MyRptText, ‘difficile AND FUZZY “DETECTED” AND NOT DETECTED’ ) as ct
order by ct.score desc;

Use case 7: Find the number of indexed values in which a term appears

select * from sa_text_index_vocab(‘RptTextIdx’,‘MyReport’,‘DBA’)
where term like ‘%[Cc]holesterol%’;

select * from sa_text_index_vocab(‘RptTextIdx’,‘MyReport’,‘DBA’)
where term like ‘%[Cc][Hh][Oo][Ll][Ee][Ss][Tt][Ee][Rr][Oo][Ll]%’;

select * from sa_text_index_vocab(‘RptTextIdx’,‘MyReport’,‘DBA’)
where term like ‘%[dD][Ii][Ff][Ff][Ii][Cc][Ii][Ll][Ee]%’;

These are only a few examples of the type of queries that can be used on a CLOB column using the UDA option; This is not a complete list of possibilities.

Advertisements

From → Data Mining, SAP IQ

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: