OrcaTec LLC, a leading provider of information retrieval software and consulting, today announced the release of Version 2.0 of the OrcaTec Information Retrieval Toolkit. The Toolkit will be distributed as an rPath-based software appliance, making it ultra-simple to install and maintain.
This easy to deploy software appliance provides an integrated collection of information analysis and management services, including concept search, near-duplicate clustering, language identification, and an interesting-phrase finder. These services are ideal for building scalable, reliable, and effective information analysis and management applications.
“For most organizations, 20 to 30 percent of the documents in their repository may be exact or near duplicates of one another,” said Herbert Roitblat, Ph.D., Principal of OrcaTec LLC. “Near- duplicates clutter search results and place a heavy burden on analysis. This software identifies these near-duplicates and allows the system to take appropriate action.”
"Software appliances present a win-win scenario for OrcaTec and their customers," said Billy Marshall, CEO of rPath. "Customers get the OrcaTec applications without installation and maintenance hassles, and OrcaTec reduces the cost of customer service by eliminating support issues."
About OrcaTec Information Retrieval Toolkit
The patent pending OrcaTec Information Retrieval Toolkit is designed to be a key component of systems for enterprise search, legal discovery, business intelligence, text data mining, content management, email archiving, knowledge management, and many other applications—anywhere finding is more important than searching.
OrcaTec Concept Searching learns the meaning of words from the documents that it reads, without having to rely on domain experts. Concept searching allows users to find information even when they may not know exactly the specific words that a document’s author used. It provides more accurate results than can be obtained with ordinary search engines and it is far easier to set up and maintain than systems that rely on taxonomies or ontologies. Built on top of Lucene, the Toolkit also includes the full complement of Boolean and proximity searching users have come to expect.
Version 2.0 supports data ingest rates as high as two million documents per day per system. These documents can be in any language from any source.
The Toolkit is based on language modeling, which is the process of analyzing the patterns of language usage in a text and using these patterns to organize and retrieve it. The Toolkit has a very powerful, but very easy to use REST-based API.