Our suppliers tag their records with huge numbers of keywords (which we then index with a context index). They use every possible variant of a word, just in case. So I might see 'surfboard, surfboarder, surfboarding, surfboarded, surfboards, surfborders, surfboardings' and so on.
When someone searches for contains(textfield,$surfboard)>0, they will match on all (or many, anyway) of those words -meaning that all bar the first keyword is really redundant.
What I'd like to do is take a keyword ("surfboard") and see what its stemmed version is actually equal to. Then I'd take the next keyword and see what it's stemmed version was equal to. And if the two things were equal to the same stemmed value, I would know to dispose of one of them.
Is there a way to see what a $word actually evaluates to? Or to see whether $word1 evaluates to the same thing as $word2, and thus be able to declare word1 is effectively equal to word2?
I have already been through this exercise using things like SOUNDEX and UTL_MATCH.EDIT_DISTANCE_SIMILARITY. I am NOT interested in those techniques at this time, as their results are not ideal for what we need to do. I am focussed on just this very specific issue: is it possible to effectively compare or otherwise make use of the Oracle Text stem output for a word for similarity comparisons?
Barbara Boehmer
Posts: 2,805
Registered: 04/28/00
Re: Stemming evaluation
Posted: Mar 25, 2009 11:34 PM in response to: Catfive Lander in response to: Catfive Lander
Click to report abuse... Click to reply to this thread Reply
http://forums.oracle.com/forums/thread.jspa?messageID=3244855
Catfive Lander
Posts: 302
Registered: 03/02/08
Re: Stemming evaluation
Posted: Mar 26, 2009 1:42 PM in response to: Barbara Boehmer in response to: Barbara Boehmer
Click to report abuse... Click to reply to this thread Reply
Thank you Barbara, as ever. I need to study your example in that thread more closely, but at first glance, that looks to be exactly what is required. Much appreciated.
Catfive Lander
Posts: 302
Registered: 03/02/08
Re: Stemming evaluation
Posted: Mar 29, 2009 4:07 PM in response to: Catfive Lander in response to: Catfive Lander
Click to report abuse... Click to reply to this thread Reply
I am having trouble making your example work. Here is my script, lifted directly from yours, I think (I hope I haven't missed anything, anyway!):
/* One-off */
BEGIN
CTX_DDL.CREATE_PREFERENCE ('your_lex', 'BASIC_LEXER');
CTX_DDL.SET_ATTRIBUTE ('your_lex', 'INDEX_STEMS', 'ENGLISH');
END;
/
/* Routine */
create index index1 on search_table(mainterm)
indextype is ctxsys.context;
CREATE INDEX index2 ON dr$index1$i (token_text)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('LEXER your_lex');
SELECT DISTINCT d.token_text keyword, i.token_text AS root_word
FROM dr$index1$i d, dr$index2$i i
WHERE CONTAINS (d.token_text, '$' || i.token_text) > 0
AND d.token_text != i.token_text
AND i.token_type = 9
ORDER BY keyword, root_word
When I created index1 on just the first 1000 records from my search_table, there was no problem, and the final query worked well. But when I created it on my entire search_table, the final query now returns the following error message:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error
DRG-50901: text query parser syntax error on line 1, column 2
We presumably have some weird keywords in our search table's main terms, but I am not sure which one(s) would be causing this problem. It seems odd to me that we can index them properly in the first place, but then have trouble querying them!
Anyway, is there a way of making the query robust so that it doesn't choke at whatever it is that is causing the problem? Is there a way of re-casting the query into a function that allows you to feed it one word at a time so that the problem words are then made more obvious?
Barbara Boehmer
Posts: 2,805
Registered: 04/28/00
Re: Stemming evaluation
Posted: Mar 29, 2009 11:46 PM in response to: Catfive Lander in response to: Catfive Lander
Correct
Click to report abuse... Click to reply to this thread Reply
In the following example, I added a special character as a printjoin to the lexer and a word starting with that character to the data, because that was the simplest thing I could think of that might cause such an error, in order to create an example that would allow me to demonstrate some error handling. Notice that I changed a few things, including which index the lexer is used on, and the queries, and such. The usage of the function and the exception handling should enable it to run without errors. However, if you try to do some things differently, so that the value to be searched for is not just passed to the function, but selected from a table within the function then matched to the parameter, it can cause every row to produce an error, which is why I had to change some things around to prevent that.
Thursday, June 18, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment