Home » Server Options » Text & interMedia » PDF documents not indexed by OracleText using auto-filter and format_column (Linux)
PDF documents not indexed by OracleText using auto-filter and format_column [message #463107] Tue, 29 June 2010 12:50 Go to next message
leon_buijsman
Messages: 13
Registered: March 2009
Location: Rotterdam
Junior Member

We have created an Oracle index on a document table. The table contains docs of all different formats. For some reason PDFs are not indexed.

This is example of what we record:

ORACLETEXT INDEX

ctx_ddl.create_preference('WD_CAND_DOC_LOB_TEXT_FIL','AUTO_FILTER');

create index wd_cand_doc_lob_text
on wd_cand_doc_lob (doc_blob)
indextype is ctxsys.context
parameters('
format column ORACLETEXT_FMT
datastore WD_CAND_DOC_LOB_TEXT_DST
etc...

DOCUMENT LINE IN DB

452900 (HugeClob) (HUGEBLOB) 2/9/2010 11:03:58 am 22482 4/24/2010 12:38:00 pm 3145 1428383 CV BINARY

where BINARY is the value in the ORACLETEXT_FMT column.

Entering nothing into the ORACLETEXT_FMT column makes no difference. Word documents are indexed by the way.

DB/OS
Oracle10g running on Linux


Does anybody has any clue?
Re: PDF documents not indexed by OracleText using auto-filter and format_column [message #463111 is a reply to message #463107] Tue, 29 June 2010 13:28 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
http://download.oracle.com/docs/cd/B19306_01/text.102/b14218/afilsupt.htm#sthref2464
Re: PDF documents not indexed by OracleText using auto-filter and format_column [message #463119 is a reply to message #463111] Tue, 29 June 2010 13:56 Go to previous messageGo to next message
leon_buijsman
Messages: 13
Registered: March 2009
Location: Rotterdam
Junior Member

Hi Barbara,

I did some more research: I get SOME pdfs indexed and some not. The only difference can be the documents themselves... However, the list you have sent me, does not clarify what makes OracleText totally not indexing PDFs.

From what I am seeing, the PDFs differ as follows:
> generated with Word2007/OpenOffice.org3.1 (fails) vs Acrobat Distiller (works)
> used font-types encoding Built-in (fails) vs encoding ANSI (works)

The document you have sent me states "Embedded fonts in a PDF document are not filtered correctly". Does that equal "encoding Built-in" and could that mean OracleText views the doc as empty? If that is the case, it means that ANY document saved by MSWord or OpenOffice should generate this problem.

Is there a workaround as far as you know?

Re: PDF documents not indexed by OracleText using auto-filter and format_column [message #463154 is a reply to message #463119] Tue, 29 June 2010 18:58 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
If none of the pdf's get indexed, then I tend to suspect a privilege problem. Oracle may not have sufficient privileges to the directory or there may be a security issue as listed below:

http://download.oracle.com/docs/cd/B28359_01/text.111/b28304/afilsupt.htm#sthref2472

The filtering gets a little better with each version, so 11g would probably filter better than 10g.

You might do some experimenting with simple documents and more complex ones to see if you can narrow the problem down and check ctx_user_index_errors to see if there are any clues there.

Re: PDF documents not indexed by OracleText using auto-filter and format_column [message #463159 is a reply to message #463154] Tue, 29 June 2010 21:27 Go to previous messageGo to next message
ebrian
Messages: 2794
Registered: April 2006
Senior Member
You could try to manually run the filter against the document using the following:

$ORACLE_HOME/bin/ctxhx problem_document.pdf problem_document.log

This may produce additional error messages.

I also had success filtering certain pdf files Using Alternative Filters for Filtering PDF Files when the native Oracle filters encountered problems.
Re: PDF documents not indexed by OracleText using auto-filter and format_column [message #463185 is a reply to message #463159] Wed, 30 June 2010 02:23 Go to previous messageGo to next message
leon_buijsman
Messages: 13
Registered: March 2009
Location: Rotterdam
Junior Member

I have cornered the problem down now to PDF 1.5 Acrobat 6. We are running Oracle10g2 using AUTO_FILTER. Apparently the support for this format is far less than the documentation (http://download.oracle.com/docs/cd/B19306_01/text.102/b14218/afilsupt.htm) states. The moment I save the document in Word2010 to PDF 1.3 Acrobat 5, the document gets indexed.

CTX_USER_INDEX_ERRORS is showing DRG-11222: Third-party filter does not support this known document format.

I fear it says it all.

We are investigating the possibility to add alternative filters now. If not, Oracle11 is probably the way to go. Does anybody know whether docx is supported under Oracle11. Documentation states Versions through 2007, but does that include docx?
Re: PDF documents not indexed by OracleText using auto-filter and format_column [message #471753 is a reply to message #463185] Tue, 17 August 2010 03:07 Go to previous message
leon_buijsman
Messages: 13
Registered: March 2009
Location: Rotterdam
Junior Member

I have received confirmation for a customer and heavy Oracle user that Oracle11g2 solve the indexing problem (not indexing PDF 1.5 acrobat and DOCX documents).
Previous Topic: Using PROCEDURE Filters for Indexing data from Multi-Table
Next Topic: Optimization taking more and more time each day
Goto Forum:
  


Current Time: Fri Mar 29 05:45:29 CDT 2024