Re: WO & Lucene
Re: WO & Lucene
- Subject: Re: WO & Lucene
- From: Chuck Hill <email@hidden>
- Date: Wed, 27 Oct 2004 10:01:04 -0700
I don't have anything of importance to add to Dov's comments but I will
emphasize two of his points. First, Lucene is document centric, not
database row or EO centric. Lucene documents are what you make of
them. It can take a while to wrap your head around this view of your
data. Second, the idea of putting non-indexed data into the document
is an important one for performance optimization. It is well worth
spending some time to plan out what you will need.
Finally, the CooperatingEditingContext from chapter 3 of
PracticalWebObjects is a good candidate for for sending the
notifications to reindex when an EO is created, updated, or deleted.
Chuck
On Oct 26, 2004, at 9:46 PM, Dov Rosenberg wrote:
SiteConnect Server is a web based CMS product that uses Lucene for all
of
our full text and attribute level search functionality. We have well
over
50,000 records in our production database that are indexed any time a
record
is updated. It takes less than a few seconds to index our database.
Some
things to consider when creating your indexes:
* Lucene lets you specify the type of information you want to store in
the
Lucene document. Not all of the data you store in the document needs
to be
indexed. For example, we store our primary keys and other meta
information
in the Lucene document but we do not index them. Index just the
important
information
* Try to store enough meta data in the Lucene document to minimize
subsequent trips to the database after getting the search results. I.e.
Store a summary as a non index field that can be displayed as part of
the
search results
* EOF is not a good solution for bulk data processing lots of records
on a
frequent basis. I would use JDBC for the initial bulk loading.
* We use a notification when we save or delete content to trigger a
reindex
on a specific Lucene record.
HTH
--
Dov Rosenberg
Conviveon Corporation
http://www.conviveon.com
On 10/26/04 1:59 PM, "Michael Parlee" <email@hidden> wrote:
Chuck and Dov,
Very cool to see that both of you are using Lucene. I've just begun
integrating it into a small project I'm working on so to hear that
others are using it gives me a bit more confidence that I'm doing the
right thing.
A couple of questions though. Are you using it to index records in a
relational database? If so how are you keeping your index up to date?
It took my 600Mhz iBook about 6 hrs to create my index! FYI, I'm
flattening several to-many relationships into a "body" field which is
indexed but not stored. Also, I did my initial index in straight up
JDBC.
I'd love to hear a little about how each of you have approached your
Lucene-WO integration.
Thanks,
Mike
On Oct 25, 2004, at 2:12 PM, Dov Rosenberg wrote:
I agree with Chuck, we have integrated Lucene into our CMS system. It
ROCKS!!
The best part about it is that you don't have the EOF overhead of
dealing
with large numbers of records. Lucene's capabilities seem limited
only
by
disk space.
You can also index any binary document if you can extract the text
from it.
We use PDFBox to grab the text out of attached PDFs for indexing.
--
Dov Rosenberg
Conviveon Corporation
http://www.conviveon.com
On 10/25/04 5:04 PM, "Chuck Hill" <email@hidden> wrote:
I'm not certain that you need to get the content out of the
database,
but
(2) and Lucene will certainly boost performance and drop memory
usage.
I'll strongly suggest going with something that includes Lucene in
the mix.
Lucene is awesome!
Chuck
At 10:17 AM 25/10/2004 -0700, David Holt wrote:
I have just upped my test database of documents from 1500 records to
17,000. I have a WOComponent with a WODisplayGroup and the
appropriate
qualifier fields for searching. If I qualify the data source by
putting a
value in one of the search fields I get a list of documents as
expected. If
I submit the form without information in any of the qualifier fields
(this
used to return the whole data set divided into paged results), I get
the
following exception from the application after a minute or so of
waiting:
Error:
com.webobjects.foundation.NSForwardException
[java.lang.OutOfMemoryError]
null
It is a MySQL database, WO 5.2.3, OS X Server 10.2
I have a blob field that holds the text content of the documents
(for
searching) as well as a URL field pointing to the original document
on the
file system. One of the qualifier fields is used to search the
content field.
Three strategies I can think of to fix the problem are:
1. Increase system memory (not a good long term solution as the
documents
will grow over time)
2. Put the blob field in a separate table so it is not loaded with
the
WODisplayGroup (not sure if I can still do searches on that field if
I do
that)
3. Get the content out of the database and use a combination of
PDFbox and
Lucene to provide the content searching separate from my database.
What would you suggest is the best strategy? Or have I misidentified
the
problem?
Thanks,
David
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
email@hidden
This email sent to email@hidden
--
Practical WebObjects - a book for intermediate WebObjects developers
who want to increase their overall knowledge of WebObjects, or those
who are trying to solve specific application development problems.
http://www.global-village.net/products/practical_webobjects
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden