Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: WO & Lucene

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: WO & Lucene

Subject: Re: WO & Lucene
From: Dov Rosenberg <email@hidden>
Date: Wed, 27 Oct 2004 00:46:46 -0400

SiteConnect Server is a web based CMS product that uses Lucene for all of
our full text and attribute level search functionality. We have well over
50,000 records in our production database that are indexed any time a record
is updated. It takes less than a few seconds to index our database. Some
things to consider when creating your indexes:

* Lucene lets you specify the type of information you want to store in the
Lucene document. Not all of the data you store in the document needs to be
indexed. For example, we store our primary keys and other meta information
in the Lucene document but we do not index them. Index just the important
information

* Try to store enough meta data in the Lucene document to minimize
subsequent trips to the database after getting the search results. I.e.
Store a summary as a non index field that can be displayed as part of the
search results

* EOF is not a good solution for bulk data processing lots of records on a
frequent basis. I would use JDBC for the initial bulk loading.

* We use a notification when we save or delete content to trigger a reindex
on a specific Lucene record.

HTH


--
Dov Rosenberg
Conviveon Corporation
http://www.conviveon.com

On 10/26/04 1:59 PM, "Michael Parlee" <email@hidden> wrote:

> Chuck and Dov,
>
> Very cool to see that both of you are using Lucene.  I've just begun
> integrating it into a small project I'm working on so to hear that
> others are using it gives me a bit more confidence that I'm doing the
> right thing.
>
> A couple of questions though.  Are you using it to index records in a
> relational database?  If so how are you keeping your index up to date?
> It took my 600Mhz iBook about 6 hrs to create my index!  FYI, I'm
> flattening several to-many relationships into a "body" field which is
> indexed but not stored.  Also, I did my initial index in straight up
> JDBC.
>
> I'd love to hear a little about how each of you have approached your
> Lucene-WO integration.
>
> Thanks,
>
> Mike
>
> On Oct 25, 2004, at 2:12 PM, Dov Rosenberg wrote:
>
>> I agree with Chuck, we have integrated Lucene into our CMS system. It
>> ROCKS!!
>>
>> The best part about it is that you don't have the EOF overhead of
>> dealing
>> with large numbers of records. Lucene's capabilities seem limited only
>> by
>> disk space.
>>
>> You can also index any binary document if you can extract the text
>> from it.
>> We use PDFBox to grab the text out of attached PDFs for indexing.
>>
>>
>> --
>> Dov Rosenberg
>> Conviveon Corporation
>> http://www.conviveon.com
>>
>>
>> On 10/25/04 5:04 PM, "Chuck Hill" <email@hidden> wrote:
>>
>>> I'm not certain that you need to get the content out of the database,
>>> but
>>> (2) and Lucene will certainly boost performance and drop memory usage.
>>> I'll strongly suggest going with something that includes Lucene in
>>> the mix.
>>> Lucene is awesome!
>>>
>>> Chuck
>>>
>>>
>>> At 10:17 AM 25/10/2004 -0700, David Holt wrote:
>>>>>>>
>>> I have just upped my test database of documents from 1500 records to
>>> 17,000. I have a WOComponent with a WODisplayGroup and the appropriate
>>> qualifier fields for searching. If I qualify the data source by
>>> putting a
>>> value in one of the search fields I get a list of documents as
>>> expected. If
>>> I submit the form without information in any of the qualifier fields
>>> (this
>>> used to return the whole data set divided into paged results), I get
>>> the
>>> following exception from the application after a minute or so of
>>> waiting:
>>> Error:
>>> com.webobjects.foundation.NSForwardException
>>> [java.lang.OutOfMemoryError]
>>> null
>>>
>>> It is a MySQL database, WO 5.2.3, OS X Server 10.2
>>> I have a blob field that holds the text content of the documents (for
>>> searching) as well as a URL field pointing to the original document
>>> on the
>>> file system. One of the qualifier fields is used to search the
>>> content field.
>>>
>>> Three strategies I can think of to fix the problem are:
>>> 1. Increase system memory (not a good long term solution as the
>>> documents
>>> will grow over time)
>>> 2. Put the blob field in a separate table so it is not loaded with the
>>> WODisplayGroup (not sure if I can still do searches on that field if
>>> I do
>>> that)
>>> 3. Get the content out of the database and use a combination of
>>> PDFbox and
>>> Lucene to provide the content searching separate from my database.
>>>
>>> What would you suggest is the best strategy? Or have I misidentified
>>> the
>>> problem?
>>>
>>> Thanks,
>>> David
>>
>>
>>  _______________________________________________
>> Do not post admin requests to the list. They will be ignored.
>> Webobjects-dev mailing list      (email@hidden)
>> Help/Unsubscribe/Update your Subscription:
>>
>> This email sent to email@hidden
>>
>


 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

Follow-Ups:
- Re: WO & Lucene
  - From: Chuck Hill <email@hidden>

References:
	>WO & Lucene (From: Michael Parlee <email@hidden>)

Prev by Date: Has anybody developed an accounting system in WO?
Next by Date: Re: WO & Lucene
Previous by thread: WO & Lucene
Next by thread: Re: WO & Lucene
Index(es):
- Date
- Thread