lucene
Apache Lucene 2.2.0
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java more>>
Apache Lucene is a technology suitable for almost any application that requires full-text search, especially cross-platform.
Main features:
- Scalable, High-Performance Indexing
- over 20MB/minute on Pentium M 1.5GHz
- small RAM requirements -- only 1MB heap
- incremental indexing as fast as batch indexing
- index size roughly 20-30% the size of text indexed
- Powerful, Accurate and Efficient Search Algorithms
- ranked searching -- best results returned first
- many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
- fielded searching (e.g., title, author, contents)
- date-range searching
- sorting by any field
- multiple-index searching with merged results
- allows simultaneous update and searching
- Cross-Platform Solution
- Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
- 100%-pure Java
- Implementations in other programming languages available that are index-compatible
DocSearcher 3.91.0
DocSearcher is a brilliant utility which uses the Open Source Lucene and POI Apache APIs more>>
DocSearcher 3.91.0 is a brilliant utility which uses the Open Source Lucene and POI Apache APIs as well as the Open Source PDF Box API to provide searching capabilities for HTML Document HTML, MS Word Document MS Word, MS Excel Document MS Excel, RTF Document RTF, PDF Document PDF, Open Office Document Open Office (and Star Office) Documents, and Text Document text documents. Other file formats are currently not supported. It runs on all systems with JAVA support.
Enhancements:
- Refactor DocType handling
- Refactor index creating and don't store the body content
- Search in body and title together is possible again
- Replace old setting file "docSearch_prefs.txt" with "docsearcher.properties"
- Remove tinylaf layout
- First try to solve some problem with servlet extension
- Remove check if last searchtext was the same, because the options can be changed
- Remove option for search in title and body, because it does not run
- Fix escaping problem in meta data report
- Fix problem with filenames contains whitespaces
- Update POI to 3.2 final
- Fix some problems in Word and Excel converter
- Add commons-io
DSpace 1.5.1
Capture, store, index, preserve and redistribute an organizations research material in digital formats more>> Capture, store, index, preserve and redistribute an organizations research material in digital formats
DSpace is a groundbreaking digital repository system, that captures, stores, indexes, preserves and redistributes an organizations research material in digital formats.
Research institutions worldwide use DSpace for a variety of digital archiving needs - from institutional repositories (IRs) to learning object repositories or electronic records management, and more.
DSpace accepts all manner of digital formats. Some examples of items that DSpace can accommodate are:
- Documents, such as articles, preprints, working papers, technical reports, conference papers
- Books
- Theses
- Data sets
- Computer programs
- Visualizations, simulations, and other models
- Multimedia publications
- Administrative records
- Published books
- Overlay journals
- Bibliographic datasets
- Images
- Audio files
- Video files
- eformatted digital library collections
- Learning objects
- Web pages
NOTE: DSpace is distributed and licensed under the terms of the BSD License.
Enhancements:
- (Scott Phillips) Fixed bug where users could not finish registering nor reset their password because the authentication method signatures were changed.
- Jay Paz (SF#1898241) Additional fixes to patch to enable reuse of methods.
- Added the ability to manage sessions with site wide alerts to prevent users from authenticating.
- Fixes a bug where the ability to edit an item durring workflow step 2 is not displayed.
- Jay Paz (SF#1898241) Add item Export from jspui and xmlui.
- Added easy support for google analytics statistics
- Added the ability for super admins to login as other users.
- Added an activity viewer to the Control Panel
- Fix for SF Bug #2082236 Subscription notification (sub-daily) no emails sent
- [2102580] William Hays: Duplicate Handle exception when replacing bitstreams
- [2102617] Sands Fish: X509Authentication fails to assign appropriate specialgroups
- (Sands Fish) Add "Select Primary Bitstream" functionality to submission workflow
- Guard against Community/Collection metadata having only whitespace characters and eliminate cases where null pointer exceptions would be thrown
- Improve DSIndexer logic in both branches to support removal of items from index when withdrawn from repository.
- (Sands Fish) Provides fix for AuthenticationUtil where users IDs are not properly compared.
- Fix NullPointerException cause by nullified Context object in LNI map item to new collection.
- Block Basic Authentication "details" from being exposed in dspace logs.
- (Bill Hays) Close InputStreamReaders explicitly to release any file handles back to OS.
- correct linking on pages when xmlui is the ROOT webapplication
- correct issue with sitemap redirection of mydspace uri.
- Add serlet-api to overlay wars to reduce compile time errors when adding classes
- Correct issues in feed generation
- XMLUI Adjust Advanced Search to use search properties from dspace.cfg.
- Correct bug in Body.toSAX where startElement is called instead of end element.
- Correct issue with libraries being excluded from wars
- Fix for SF bug #2090761 Statistics wrong use of dspace.dir for log location
- Fix for SF bug #2081930 xmlui hardcoded strings in EditGroupForm.java
- Fix for SF bug #2080319 jspui hardcoded strings in browse
- Fix for SF bug #2078305 xmlui hardcoded strings used in UI in xmlui-api
- Fix for SF bug #2078324 xmlui hardcoded strings used in UI in General-Handler.xsl
- SF patch #2076066 Review in jspui submission non-dc metadata
- SF Bug #1983859 added Foreign Lucene Analyzers to poms
- SF Bug #1989916: missing LDAP authentication key
- [1947036] Patch for SF Bug1896960 SWORD authentication and LDAP + [1989874] LDAPAuthentication
- pluggable method broken for current users
- Added copying of registration email template to 1.4 to 1.5 upgrade instructions
- Fix for SF bug #2055941 LDAP authentication fails for new users in SWORD and Manakin
- [1990660] SWORD Service Document are malformed / Corrected Atom publishing MIME types
- Updated installation and configuration documents for new statistics script, and removed references to Perl
- Fix for SF bug #2095402: Non-interactive Submission Steps dont work in JSPUI 1.5
- Fix for SF bug #2013921: Movement in Submission Workflow Causes Skipped Steps
- Fix for SF bug #2015988: Configurable Submission bug in SubmissionController
- Fix for SF bug #2034372: Resorting Search Results in JSPUI always gives no results
- Updates to Community/Collection Item Counts (i.e. strengths) for XMLUI.
- 1.5 upgrade instructions were missing Metadata Registry updates necessary to support SWORD.
- Fix various problems with resources potentially not being freed, and other minor fixes suggested by FindBugs
- Replace URLEncoder with StringEscapeUtils for better fix of escaping the hidden query field
- Fix #2034372: Resorting in JSPUI gives no results
- Fix #1714851: set eperson.subscription.onlynew in dspace.cfg to only include items that are new to the repository
- Fix issue where the browse and search indexes will not be updated correctly if you move an Item
- Fix problem with SWORD not accepting multiple concurrent submissions
- Fix #1963060 Authors listed in reverse order
- Fix #1970852: XMLUI: Browse by Issue Date "Type in Year" doesnt work
- Statistics viewer for XMLUI, based on existing DStat. Note that this generates the view from the analysis files (.dat), does not require HTML report generation.
- Fixed incorrect downloading of bitstream on withdrawn item
- Add JSPUI compatible log messages to XMLUI transformers
- Clean up use of ThreadLocal
- Improved cleanup of database resources when web application is unloaded
- Fix bug #1931799: duplicate "FROM metadatavalue"
- Fixed Oracle bugs with ILIKE operators and LIMIT/OFFSET clauses

Jaeksoft WebSearch 0.3
Open source web search engine and crawler. more>>
An open source web search engine build in JAVA. Full featured: Powerful and scalable crawler. Efficient indexation for relevant results. Fast searcher with snippets. Result rendering in HMTL or XML format. Jaeksoft WebSearch is based on best JAVA technologies: The lucene Text search engine library Apache Lucene. The stable and powerful server Apache Tomcat. An ergonomic user interface powered by Java Server Faces and Jboss RichFaces.
Jericho HTML Parser 3.0
Free and open source HTML parser for your Mac more>> Free and open source HTML parser for your Mac
Jericho HTML Parser is an open source Java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any invalid or unrecognized HTML.
Jericho HTML Parser also provides high-level HTML form manipulation functions.
Main features:
- - The presence of badly formatted HTML does not interfere with the parsing of the rest of the document, which makes the library ideal for use with "real-world" HTML that chokes other parsers.
- - PHP, JSP, ASP, PSP and Mason server tags are explicitly recognized by the parser. This means that normal HTML is still parsed properly even if there are server tags inside them, which is common for example when dynamically setting element attributes.
- - It is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.
- - Compared to a tree based parser such as DOM, the memory and resource requirements can be far better if only small sections of the document need to be parsed or modified. Incorrect or badly formatted HTML can easily be ignored, unlike tree based parsers which must identify every node in the document from top to bottom.
- - Compared to an event based parser such as SAX, the interface is on a much higher level and more intuitive, and a tree representation of the document element hierarchy is easily created if required.
- - The begin and end positions in the source document of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a tree.
- - The row and column number of each position in the source document are easily accessible.
- - Provides a simple but comprehensive interface for the analysis and manipulation of HTML form controls, including the extraction and population of initial values, and conversion to read-only or data display modes. Analysis of the form controls also allows data received from the form to be stored and presented in an appropriate manner.
- - Custom tag types can be easily defined and registered for recognition by the parser.
- - Built-in functionality to extract all text from HTML markup, suitable for feeding into a text search engine such as Apache Lucene.
- - Built-in functionality to render HTML markup with simple text formatting.
- - Built-in functionality to format HTML source code that indents elements according to their depth in the document element hierarchy.
- - Built-in functionality to compact HTML source code by removing all unnecessary white space.
- Page: 1 of 1
- 1