  
	
  
  
	
	  Features and System requirements
	
	
	  ht://Dig Copyright © 1995-2004 
THANKS.htmlThe ht://Dig Group 	  Please see the file 
COPYINGCOPYING  for
	  license information.
	
	
	
	  Features
	
	
	  Here are some of the major features of ht://Dig. They are in
	  no particular order.
	
	
	
	  
		
*		Intranet searching
	  
	  
		ht://Dig has the ability to search through many servers
		on a network by acting as a WWW browser.
	  
	  
		
*		It is free
	  
	  
		The whole system is released under the
		
COPYINGGNU Library General Public License (LGPL) 	  
	  
		
*		Robot exclusion is supported
	  
	  
		The 
http://www.robotstxt.org/wc/norobots.html		Standard for Robot Exclusion
 is
		
meta.html#robotssupported by ht://Dig. 	  
	  
		
*		Boolean expression searching
	  
	  
		Searches can be arbitrarily complex using boolean
		expressions.
	  
	  
		
*		Phrase searching
	  
	  
		A phrase can be searched for by enclosing it in quotes.
		Phrase searches can be combined with word searches, as in
		
Linux and "high quality".
	  
	  
		
*		Configurable search results
	  
	  
		The output of a search can easily be tailored to your
		needs by means of providing HTML templates.
	  
	  
		
*		Fuzzy searching
	  
	  
		Searches can be performed using various
		
attrs.html#search_algorithmconfigurable algorithms .
		Currently the following algorithms are
		supported (in any combination):
		
		  
			exact
		  
		  
			soundex
		  
		  
			metaphone
		  
		  
			common word endings
		  
		  
			synonyms
		  
		  
			accent stripping
		  
		  
			substring and prefix
		  
		  
			regular expressions
		  
		  
			simple spelling corrections
		  
		
	  
	  
		
*		Searching of many file formats
	  
	  
		Both HTML documents and plain text files can be
		searched directly ht://Dig itself.  There is also a
		
attrs.html#external_parsersmechanism
		to allow external programs ("external parsers")
 to be used
		while building the database so that arbitrary file formats
		can be searched. 
	  
	  
		
*		Document retrieval using many transport services
	  
	  
		Several transport services can be handled by ht://Dig,
		including http://, ftp:// and file:///.
		There is also a
		
attrs.html#external_protocolsmechanism
		to allow external programs ("external protocols")
 to be used
		while building the database so that arbitrary transport
		services can be used. 
	  
	  
		
*		Keywords can be added to HTML documents
	   
	  
		Any number of 
meta.htmlkeywords 		can be added to HTML documents
		which will not show up when the document is viewed.
		This is used to make a document more like to be found
		and also to make it appear higher in the list of
		matches.
	  
	  
		
*		Email notification of expired documents
	  
	  
		Special meta information can be added to HTML documents
		which can be used to
		
notification.htmlnotify the maintainer  of those
		documents at a certain time. It is handy to get
		reminded when to remove the "New" images from a certain
		page, for example.
	  
	  
		
*		A Protected server can be indexed
	  
	  
		ht://Dig can be told to use a specific
		
attrs.html#authorizationusername and password 		when it retrieves documents. This can be used
		to index a server or parts of a server that are
		protected by a username and password.
	  
	  
		
*		Searches on subsections of the database
	  
	  
		It is easy to set up a search which only returns
		documents whose
		
hts_form.html#restrictURL matches a certain pattern. 		This becomes very useful for people who want to make their
		own data searchable without having to use a separate
		search engine or database.
	  
	  
		
*		Full source code included
	  
	  
		The search engine comes with full source code. The
		whole system is released under the terms and conditions
		of the 
COPYINGGNU Library General Public License (LGPL) version
		2.0
	  
	  
		
*		The depth of the search can be limited
	  
	  
		Instead of limiting the search to a set of machines, it
		can also be restricted to documents that are a certain
		number of 
attrs.html#max_hop_count"mouse-clicks" 		away from the start document.
	  
	  
		
*		Full support for the ISO-Latin-1 character set
	  
	  
		Both SGML entities like '&agrave;' and ISO-Latin-1
		characters can be indexed and searched.
	  
	
	
	
	
	  Requirements to build ht://Dig
	
	
	  ht://Dig was developed under Unix using C++.
	
	
	  For this reason, you will need a Unix machine, a C compiler
	  and a C++ compiler. (The C compiler is needed to compile some
	  of the GNU libraries)
	
	
	  Unfortunately, we only have access to a couple of different
	  Unix machines. ht://Dig has been tested on these machines:
	
	
	  
		FreeBSD 4.6 (using gcc 2.95.3) 
	  
	  
	        Mandrake Linux 8.2 (using gcc 3.2) 
	  
	  
		Debian, 2.2.19 kernel (using gcc 2.95.4) 
	  
	  
	        Debian on an Alpha 
	  
	  
	        RedHat 7.3, 8.0 
	  
	  
	        Sun Solaris 2.8 = SunOS 5.8 (using gcc 3.1) 
	  
	  
	        Sun Solaris 2.8 = SunOS 5.8 (using Sun's cc / g++ 3.1) 
	  
	  
	        Mac OS X 10.2 (using gcc) 
	  
 	
	There are reports of ht://Dig working on a number of other platforms.
	
	  libstdc++
	
	
	  If you plan on using g++ to compile ht://Dig, you have to make
	  sure that libstdc++ has been installed. Unfortunately, libstdc++ is a
	  separate package from gcc/g++. You can get libstdc++ from the
	  
ftp://ftp.gnu.org/pub/gnu/GNU software archive .
	
	
	
	  Disk space requirements
	
	
	  The search engine will require lots of disk space to store
	  its databases. Unfortunately, there is no exact formula to
	  compute the space requirements. It depends on the number of
	  documents you are going to index but also on the various
	  options you use.
	  
	  
As a temporary measure, 3.2 betas use a very inefficient
	  database structure to enable phrase searching.  This will be
	  fixed before the release of 3.2.0.  Currently, indexing a site of
	  around 10,000 documents gives a database of around 400MB using the
	  default setting for
	  
attrs.html#max_doc_sizemaximum document size  and storing the
	  
attrs.html#max_head_lengthfirst 50,000 bytes of each document 	  to enable context to be displayed.
	  
	
	  Keep in mind that we keep at most 50,000 bytes of each
	  document. This may seen a lot, but most documents aren't very
	  big and it gives us a big enough chunk to almost always show
	  an excerpt of the matches.
	
	
	  You may find that if you store most of each document, the
	  databases are almost the same size, or even larger than the
	  documents themselves! Remember that if you're storing a
	  significant portion of each document (say 50,000 bytes as
	  above), you have that requirement, plus the size of the word
	  database and all the additional information about each document
	  (size, URL, date, etc.) required for searching.
	
	
	Last modified: $Date: 2004/05/28 13:15:19 $
  
