Internet-Draft                                              Carlos Bueno
                                                                May 2004
                                              

             A Distributed Web Search Protocol -- Dowser/0.1
			draft-dowser-spec-00.txt 
             

Status of This Memo

   This document is an Internet-Draft and is subject to all provisions 
   of Section 10 of RFC2026.
         
   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as
   Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time.  It is inappropriate to use Internet-
   Drafts as reference material or to cite them other than as
   "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html
   
   This document Expires on November 04, 2004

Abstract
   
   Dowser is an application-level, peer-to-peer protocol for creating a
   searchable index and cache of documents. It is intended for both 
   small-scale intranets and Worldwide Web-scale indexes. This document
   describes the messages that members, or "nodes", of the network pass 
   to each other to distribute workload, request & respond to queries, 
   and maintain the index.


Bueno                                                          [Page 1]

Internet-Draft                 Dowser/0.1                       May 2004


Table of contents

	1. Introduction  ................................................. 3
	1.1  Purpose  .................................................... 3
	1.2  Definitions  ................................................ 3
	1.3  URL vs. URI vs. URN  ........................................ 4
	1.4  Requirements  ............................................... 5
	2. Some Examples  ................................................ 5
	2.1  NODEFIND  ................................................... 5
	2.1.1 Node-id & seed  ............................................ 6
	2.1.2 Last-key  .................................................. 6
	2.1.3 Ring-id  ................................................... 7 
	2.1.4 Port  ...................................................... 7
	2.1.5 Server Responses  .......................................... 7
	2.2 SEARCH  ...................................................... 8
	2.2.1 Expires  ................................................... 9
	2.2.2 Content-key  ............................................... 9
	2.2.3 Search syntax  ............................................. 9
	2.3. URL caching  ............................................... 10
	2.4. Content caching  ........................................... 11
	2.5. CRAWL  ..................................................... 11
	2.6.INDEXADD  ................................................... 12
	3. Announcing  .................................................. 13
	4. Redundancy  .................................................. 14
	5. Optional headers  ............................................ 14
	6. Error conditions  ............................................ 14
	7. Notes on implementation  ..................................... 15
	7.1 Ranking  .................................................... 15
	7.2 Proxying  ................................................... 15
	7.3 TCP vs UDP  ................................................. 16
	7.4 Ping/Pong  .................................................. 16
	7.5 MHTML  ...................................................... 16
	7.6 Spam  ....................................................... 16
	7.7 Indexing  ................................................... 16
	8. Acknowledgements  ............................................ 16
	9. References  .................................................. 17
	9.1 Informative  ................................................ 17
	9.2 Normative  .................................................. 17
	10. Security Considerations  .................................... 17
	10.1 Personal Information  ...................................... 18
	10.2 Sensitive data  ............................................ 18
	11. IPR & Copyright  ............................................ 18
	12. Contact Information  ........................................ 19


Bueno                                                          [Page 2]

Internet-Draft                 Dowser/0.1                       May 2004


1. Introduction

1.1 Purpose

   The express purpose of the Dowser protocol is to encourage open 
   research into web-scale, decentralized indexing systems. The 
   specification for the Hypertext Transfer Protocol [RFC1945] 
   has this observation:

      Practical information systems require more functionality than
      simple retrieval, including search, front-end update, and
      annotation. HTTP allows an open-ended set of methods to be used to
      indicate the purpose of a request.
      
   HTTP is in widespread use by hundreds of millions of people every 
   day, and they issue hundreds of millions of searches for information. 
   Most of these searches are served by centralized "search engines" 
   that cache copies of as many websites as possible to create their 
   indexes. The problems of scale have so far been admirably met 
   [GOOGLE], but the operating costs of web-scale engines are now out of 
   the reach of most Universities and corporations. Conversely, larger 
   and larger amounts of storage and processing power are available to 
   personal computer users every year. 

   Dowser is an extension to HTTP that can be used by itself or added to 
   existing HTTP implementations. Each node on the network claims a 
   small part of the keyspace of a distributed hash table [CHORD]. 
   Indexes of documents are keyed under the hash of the word; documents 
   themselves  are keyed by their Uniform Resource Identifier [URI]. 
   There is also a  "cache of caches", to allow popular documents or 
   indexes to be  retrieved from any node that has a copy.

   The syntax and formatting rules are inherited from HTTP/1.1 
   [RFC2616]. The required "Host" header in HTTP/1.1 does not really 
   apply to Dowser, but including it shouldn't hurt anyone. Several 
   other headers and return codes are adopted for use.


1.2 Definitions

   node
   
      In this protocol there is no difference between a "client" or a 
      "server". They are all equal "nodes" on the network, capable of 
      sending and serving requests. Each node is identified by a 
      40-digit hexadecimal number called a "node-id". They participate 
      in creating, storing and serving a distributed hash table. This 
      hash table can contain document caches, indexes of those 
      documents, and other data.

      For sanity's sake, the term "server" can be taken to mean 
      "responding node" and "client" to mean "requesting node" in the 
      context of the conversation being described.


Bueno                                                          [Page 3]

Internet-Draft                 Dowser/0.1                       May 2004


   hash, key
   
      Hash functions are used to evenly distribute data around the 
      network and to uniquely identify nodes and pieces of information. 
      Dowser uses the Secure Hash Algorithim, version 2 [SHA]. The terms 
      "hash" and "key" refer to the 40-digit hexadecimal number gotten 
      from passing some piece of information through the SHA function, 
      e.g.:
    
         SHA("foo") == 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33

   distributed hash table
    
      A distributed hash table is a key-->value mapping that is 
      contained on many computers, where they keys are hashed.

   range
    
      The "range" of a node is defined by the first and last key of the 
      hash table it is expected to store data under.

   left-hand, right-hand
       
      The total keyspace is imagined as the perimeter of a circle, or a 
      "ring". While standing in the center, points to the left generally 
      decrease and points to the right generally increase, except at the 
      point where "fff..." meets "000...". 

      A useful analog is timezones, where West is left and East is 
      right. The time is always earlier to the left except at the 
      International Date Line.


1.3 URL vs. URI vs. URN
   
   The protocol descibed here has a focus different from most 
   filesharing protocols. Searches for files are not broadcast but 
   targeted; nodes are authoritative regarding the location and 
   checksums of data in their ranges. Searches for a particular node are 
   sent to arbitrary nodes but not broadcast per se; they are not 
   forwarded but responded to directly.

   In most ad-hoc filesharing protocols the only unique identifier is 
   the checksum. The filesystem names and metadata may be different for 
   different copies but the checksum remains the same. In HTTP jargon 
   this is a Universal Resource Name or URN , that which uniquely 
   identifies the file without necessarily referencing its location or 
   common name [RFC2396].

   The content of a cached web page may change with time. This means the 
   checksum will change while its Univeral Resource Locator, the origin 
   of the page, remains constant.

   Indexes built from these cached pages are also treated as files, one 
   index file per word or phrase. An index's Universal Resource Name is 


Bueno                                                          [Page 4]

Internet-Draft                 Dowser/0.1                       May 2004


   that word. (More correctly it would be "dowser://foo", not "foo".)

   Searching in Dowser therefore starts with a lookup for the node(s) 
   who handle the search term(s). The search terms (URN) are then sent 
   to those nodes, resulting in either a) the content of index file or 
   b) the latest checksum of the index file and a list of nodes that 
   have previously downloaded a copy. Retrieving a cache of a web page 
   is done the same way.

   
1.4 Requirements

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [34].

   An implementation is not compliant if it fails to satisfy one or more
   of the MUST or REQUIRED level requirements for the protocols it
   implements. An implementation that satisfies all the MUST or REQUIRED
   level and all the SHOULD level requirements for its protocols is said
   to be "unconditionally compliant"; one that satisfies all the MUST
   level requirements but not all the SHOULD level requirements for its
   protocols is said to be "conditionally compliant."
   
   In reality, the complications of this protocol are mostly in the  
   state information, not the ins and outs of the messages being passed.
   

2. Some Examples

   We will assume the reader is familiar with HTTP transactions. It is 
   probably most instructive to give examples using the new methods and 
   then a discussion of particulars.  Elipses ("...") are used to 
   truncate long hashes for readability.


2.1 NODEFIND

   NODEFINDs are the bread and butter of routing between nodes. The 
   client wants to find the node(s) that have items under a certain hash 
   key, in this example, "5c89379d0aa9840ac910fd8cacbde2dbf877214a". 
   This can be a search term, url, or anything. The responding node may 
   or may not have this key. If not, it tells the client about nodes it 
   knows about that are closer to the target.


      NODEFIND 5c89379d0aa9840ac910fd8cacbde2dbf877214a Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 12222... a5754...
      Last-key: 13333...
      Port: 4666
      

   - Response -
   

Bueno                                                          [Page 5]

Internet-Draft                 Dowser/0.1                       May 2004


      Dowser/0.1 310 Not in my range
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34....
      Last-key: 49999...

      192.0.2.10 8000 50000f... 599999...
      192.0.2.20 6876 43333a... 610000...
 
 
   - Or -
   
   
      Dowser/0.1 211 That's me
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 4000... de34....
      Last-key: 7aaaa...
 

   There is a lot to cover here. The first line of the request is like 
   HTTP, identifying the protocol and version, the 'method' or action to 
   be done, and the 'path' or key being requested. There are three 
   additional headers which MUST be included in any message: Ring-id, 
   Node-id, and Last-key. Additionally, the Port header MUST be included 
   in every request.


2.1.1 Node-id & seed
   
   The node-id is comprised of two 40-digit hexadecimal numbers, 
   separated by whitespace. The first number serves as the FIRST key in 
   this node's range. This number MUST be generated by hashing some 
   random data iteratively. To help ensure this has occurred, the second 
   number (the "seed") MUST be the second-to-last result of the 
   iterations.

         Node-id: b274f2e2a8d2881035af5866014e9ad5510ab15d
            cdd2ae2594a83ef90c05ee6014b78631db8538d8

   This example (linewrapped to fit) is well-formed because the first 
   number is the SHA hash of the second number. Servers MUST check that 
   the node-id and its "seed" are correct. If not, they MUST send a 412 
   "Precondition failed" error.


2.1.2 Last-key
   
   Last-key is also an SHA hash, and is the LAST key in this node's 
   range. The last-key may change over time, as nodes leave or join the 
   network.

   Note: it has been brought up that a node might be able to route a 
   large portion of traffic to itself by making its Last-key equal to 
   the node-id minus 1. In Internet Protocol terms, this would be 
   similar to setting your network address to 255.255.255.255. There is 


Bueno                                                          [Page 6]

Internet-Draft                 Dowser/0.1                       May 2004


   nothing in the protocol to prevent this, but implementations should 
   come up with a way to deal with it reasonably. Our feeling that it is 
   self-correcting. If someone tries it on a network with a lot of nodes 
   one hopes he or she has good fire-supression equipment handy; but the 
   concern is the possible damage to the network routing.


2.1.3 Ring-id
   
   Ring-id is yet another SHA hash to identify the group of nodes it 
   wishes to communicate with. This allows many Dowser networks to be 
   running on an internet at once. Alternate rings are useful for 
   testing implementations of the protocol on a large scale, or for 
   partitioning networks in general.

   The "official" public ring-id:

      Dowser/0.1  -->  deadbeef00000000000000000000000000000000
 
   If the Ring-id header sent does not match the ring-id of the server, 
   it MUST send a 412 "Precondition falied" message, and may provide 
   information in the body of the response as to the reason. Note: there 
   is a security risk in actually revealing the server's ring-id in the 
   response, if the ring is intended for private data.


2.1.4 Port

   Simple enough: we need to advertise what TCP or UDP port we are 
   listening on for future reference.


2.1.5 Server Responses

   If the hash is not in the listening node's range, it returns a 310 
   code and a list of nodes that are closer to the target.  The list of 
   nodes takes this form:
   
      IP PORT NODE-ID LAST-KEY
      
      
   If it is in range, it sends a 211 "That's Me" code.

   When overlap occurs a node may choose which one to contact first in 
   an implementation-dependent way, such as IP or network distance, or 
   closeness of the desired key to the node-id.

   Note: IPv4 addresses are used in these examples, but implementations 
   SHOULD be written to handle IPv6 addresses as well, e.g.:

      FEDC:BA98:7654:3210:FEDC:BA98:7654:3210         
      1080:0:0:0:8:800:200C:4171
      3ffe:2a00:100:7031::1
      1080::8:800:200C:417A
      ::192.9.5.5


Bueno                                                          [Page 7]

Internet-Draft                 Dowser/0.1                       May 2004


      ::FFFF:129.144.52.38
      2010:836B:4179::836B:4179
      

2.2 SEARCH

   We have the basis of all conversations, and have described how nodes 
   identify themselves and each other. Now we can go to the next level: 
   content routing.


   - Request -

      SEARCH foo+bar+baz Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 1d3470c... 55754...
      Last-key: 1e400cc...
      Port:  4666


   - Response -
      
      Dowser/0.1  200 OK
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888... de34...
      Last-key: 49999...
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a

      URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]]
      URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]]
      ...


   - Or -
   
      Dowser/0.1 300 Multiple choices
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888... de34...
      Last-key: 49999...
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a

      192.0.2.10 8000 50000f... 599999...
      192.0.2.20 6876 43333a... 610000...
 
 
   A SEARCH request is almost exactly like a nodefind except the "path" 
   is a list of search terms. The server determines if any of the terms 
   fall into its range. How to break up the terms may be 
   implementation-specific, but the simplest is to break on whitespace.


   If any are in-range it returns a 300 code, the "content-key" of the 
   data (a  SHA checksum of the content) a list of the nodes that have 


Bueno                                                          [Page 8]

Internet-Draft                 Dowser/0.1                       May 2004


   requested that data.

   Or, it may return the index data itself. Requesting nodes SHOULD keep 
   a copy of the data it receives for a reasonable amount of time, as 
   hinted by the Expires: header. The data should be keyed under the 
   content-key, for retrieval via a CACHE request. This helps distribute 
   the workload. The index is in the form:
   
      URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]]

   And is tab-delimited. AGE is the number of seconds since the record 
   was created. One or more "SNIPPET"s contain text that form an 
   abstract of the document.

   As always, the server can alternately send the standard 310 Not in 
   Range, or some error code.


2.2.1  Expires:
   
   Indexes and caches of web documents are not long-lived in the way a 
   media file in traditional peer-to-peer networks are. The Expires 
   header hints at how long a node should cache this data, expressed in 
   seconds from the present time.
   

2.2.2 Content-key:

   This is a SHA hash of the data's content, used to positively identify 
   it. This is different from the hash of the URL or search term. A page 
   whose URL is

      "http://foo.example.com"

   ... and whose content is

      "<h1>Foo!</h1>"

   ... would have a URL hash of cfab46bb7dbd11e6187360d429586e2942f2d42e 
   and a content-key of cf5ce65061218164e4148038cc3a56a9e988fe7a. The 
   URL hash is a unique identifier of the document's origin, while the 
   content-key is an identifier of a particular version of that 
   document.


2.2.3 Search syntax

   There are a few schemes for search syntax out in the wild. The most 
   common has operands for words that must or must not appear, and exact 
   phrases. Implementations MUST honor these operands to the best of 
   their ability.

      foo +bar
      

Bueno                                                          [Page 9]

Internet-Draft                 Dowser/0.1                       May 2004


   Means that the documents MUST contain "bar" and MAY contain "foo"
 
      foo -bar
      
   Means that the documents MUST contain "foo" and MUST NOT contain 
   "bar". When there is only one term without a "-" operand, the "+" is
   implied.

      "foo bar"
      'foo bar'
      
   Means that the documents MUST contain the exact phrase "foo bar", in
   that order.


   Note: in the SEARCH example in section 2.0, it may seem like the "+" 
   signs in "foo+bar+baz" are search syntax. They are not; the terms are 
   "url-encoded", meaning spaces are encoded as "+" and other non-word 
   characters are encoded as %HEX where HEX is the hexadecimal ASCII 
   code. Therefore "foo +bar" would appear as "foo+%2Bbar"

   Implementations may honor other operands as they see fit.


2.3. URL caching

   Like a SEARCH, a URLCACHE does not necessarily return the data -- it 
   can return the content-hash of the data and a list of the last X 
   nodes to request that data, where X is implementation-specific. If 
   none of the listed nodes have the data, the requesting node may then 
   request it from the original node.

   - Request - 

      URLCACHE fb9642989a2222305832a5d9c29b34... Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 1d3470c... a5754......
      Last-key: 1e400...
      Url:  http://foo.example.com/some/page.html
      Port:  4666

   - Response - 
   
      Dowser/0.1  300 Multiple choices
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34...
      Last-key: 49999...
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
  
      192.0.2.10 8000 50000f... 599999...
      192.0.2.20 6876 43333a... 610000...
 
   - Or -


Bueno                                                          [Page 10]

Internet-Draft                 Dowser/0.1                       May 2004


      Dowser/0.1  200 OK
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34...
      Last-key: 49999...
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
 
      ... (file data) ...

  
2.4. Content caching

   This can be page caches, indexes, or really anything. A client sends 
   a request with the content-key as the path:

   - Request -
   
      CACHE 3fdb13677b10691debb3909dd917b00ee751115a Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 1d3470c...  de34...
      Last-key: 1e400...
      Url:  http://foo.example.com/some/page.html
      Port:  4666

   - Response -
   
      Dowser/0.1  200 OK
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34...
      Last-key: 49999...
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
 
      ... (file data) ...

   - Or -
 
      Dowser/0.1 404 not found
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...   de34...
      Last-key: 49999...

## todo: content-length? simultaneous downloading? the Range: & Content-length: headers?


2.5. CRAWL

   So we know how to find nodes and content. The last question is: where 
   does the content come from? And the indexes?


Bueno                                                          [Page 11]

Internet-Draft                 Dowser/0.1                       May 2004


      CRAWL http://foo.example.com/foo/bar/baz.html Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 12222... a5754......
      Last-key: 13333...
      Referer: http://foo.example.com/foo/index.html
      Port: 4666
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a


   - Response - 
   
      Dowser/0.1 202 Accepted
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...   de34...
      Last-key: 49999...


   - Or -
 
      Dowser/0.1 310 Not in my range
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...   de34...
      Last-key: 49999...
 
      192.0.2.10 8000 50000f... 599999...
      192.0.2.20 6876 43333a... 610000...


   When a node crawls a page, it usually has links on it. If the hash of 
   those second-level URLs are not in the node's range, it sends a CRAWL 
   annoucement to the nodes that do handle that range, so they can 
   continue the crawl.

   A node SHOULD first check (via URLCACHE) if a cached copy exists 
   before downloading a page directly.

   Implementations SHOULD respect the "robots.txt" protocol for nice 
   spiders.
   
   ##todo: so where do we START crawling? that is not part of the 
   protocol because it could be a network of office machines sharing 
   their spreadsheets, or whatever -- need to say that in a nice way.


2.6.INDEXADD

   An INDEXADD is an informational message to a node that a particular 
   URL contains a keyword inside the server's range.


      INDEXADD 09dd917b00ee751115a3fdb13677b10691debb39 Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000


Bueno                                                          [Page 12]

Internet-Draft                 Dowser/0.1                       May 2004


      Node-Id: 12222...  a5754......
      Last-key: 13333...
      Term:  foo bar baz
      Url: http://foo.example.com/foo/bar/baz.html
      Expires: 864000
      Content-key: 3fdb13677b10691debb3909dd917b00ee751115a
      Port: 4666

##todo: positions, counts, tdf*idf?

   - Response -
   
      Dowser/0.1 202 Accepted
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34..
      Last-key: 49999...

   - Or -
      
      Dowser/0.1 310 Not in my range
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 48888...  de34..
      Last-key: 49999...
 
      192.0.2.10 8000 50000f... 599999...
      192.0.2.20 6876 43333a... 610000...


   The Term: header contains what words the server may be interested in, 
   separated by tabs. An INDEXADD message implies that the requesting 
   node has a copy of the document. Listening nodes should retrieve it 
   from that node via a CONTENTCACHE message. This puts a cost on adding 
   words to the index. 


3. Announcing

   When a new node joins the network, it must first be provided with a 
   list of "well-known" nodes and a ring-id. Knowing nothing else about 
   the network topology, it generates a node-id and sends a NODEFIND 
   request on that key to those well-known nodes for its own key:


      NODEFIND 12222...0 Dowser/0.1
      Ring-Id: deadbeef00000000000000000000000000000000
      Node-Id: 12222... a5754...
      Last-key: 12222...
      Port:  4666
      

   This eventually leads the new node to its "neighbors" on the ring. 
   The new node's range SHOULD initially be of the same length as its 
   left-hand neighbor, but MAY grow or shrink according to how much 
   capacity it has at hand, but the Last-key MUST NOT be less than the 
   node-id of its closest neighbor.


Bueno                                                          [Page 13]

Internet-Draft                 Dowser/0.1                       May 2004


4. Redundancy

   All nodes may not be available at all times, and the "overlap" is not 
   guaranteed, requiring some fiat redundancy. So:

   Each node assumes a range starting with its nodeid and That defines 
   its "core" range. It SHALL also handle an "auxilary" range of equal 
   length, calculated by adding 8 to the first digit. The addtion does 
   not carry. For example:

   Node A "Core" range:
   
      9000000000000000000000000000000000000000 to
      c000000000000000000000000000000000000000

   Auxilary range:
   
      1000000000000000000000000000000000000000 to
      4000000000000000000000000000000000000000


5. Optional headers

   To save on bandwidth, nodes MAY compress their responses. A 
   requesting node MAY indicate that it can accept compressed data with 
   this optional header:

      Accept-Encoding: gzip, deflate
 
   If a server compresses a response, it MUST inidcate that with this 
   header:

      Content-encoding: gzip


   The server MAY also send this optional header to SEARCH responses to 
   help guide the user:

      Related-terms: meta, syntax, jargon


6. Error conditions

   Many error codes from HTTP/1.1 apply to Dowser.
   
   500 Internal Server error
   
   501 Not Implemented
   
   503 Server busy (Service unavailable)
   

Bueno                                                          [Page 14]

Internet-Draft                 Dowser/0.1                       May 2004


   505 HTTP Version Not Supported
   
   400 Bad request
   
   404 Not Found
   
   412 Precondition failed
   
      When a node's id and seed don't check out, for instance. This is 
      different from a mal-formed request which should be responded to 
      with a 400. Also, nodes may keep a "blacklist" of node-ids and/or 
      IPs that they do not wish to do business with (See SPAM). In the 
      case of blacklists, the server MAY simply close the connection,       
      but it's polite to give some error information.

      Implementations MAY elect to withold the Ring-id header in this 
      case, for security.


7. Notes on implementation

   A reference implementation is currently in alpha testing and should be
   released by mid-summer 2004 [IMPL]
   

7.1 Ranking

   There is not a lot of ranking data returned by SEARCH requests. The 
   idea is to return somewhat raw indexes to the client and let it sort 
   things out through intersecting multiple responses and its own 
   biases. 

   One time-saving trick may be employed by implementations: recall in 
   the SEARCH example, all of the search terms were sent, even though 
   it's unlikely one node will handle all of them. The reason is to 
   allow nodes to pick out terms that tend to accompany the terms they 
   are responsible for. The node may then keep an extra index of those 
   accompanying terms and use it to trim and/or rank the results sent 
   back.


7.2 Proxying

   If a relatively new node receives a request for a key that is in its 
   range but not its datastore, it may create a new request to its 
   upstream neighbor for it, on the theory that it may still be in that 
   node's datastore. It then keeps a copy while forwarding it to the 
   original requesting node. This allows new nodes to build on the work 
   of older ones.

   Implementations might want to work out semantics for long-lived 
   connections on the lines of HTTP/1.1 Keep-alive headers.


Bueno                                                          [Page 15]

Internet-Draft                 Dowser/0.1                       May 2004


7.3 TCP vs UDP

   Several methods (NODEFIND, CRAWL, INDEXADD) have two interesting 
   properties: they will be used often, and involve a very small amount 
   of data. It is tempting to implement those parts over UDP and the 
   rest over TCP, or to come up with some tricks to do it all over UDP.


7.4 Ping/Pong

   Neighboring nodes should be in the habit of pinging their nearest 
   neighbors, if they have not heard from them in a while. A "ping" can 
   take the form of a NODEFIND using the neighbor's node-id.


7.5 MHTML

   HTML documents of today tend to be comprised of many files including 
   graphics, stylesheets, etc. It may be useful for implementations to 
   offer "MHTML" versions of cached documents that contain these 
   supporting files in one MIME-encoded file. If a client supports 
   MHTML, it should include it in the Accept: header. Common-sense 
   restrictions apply, such as not including executable scripts or 
   plugins and files that do not come from the same domain.


7.6 Spam
   
   The network described above may one day contain thousands of nodes. 
   Thousands of nodes implies thousands of users looking for 
   information, which nowadays means spam. An implementation may elect 
   to deal with it in the way email clients do, with filters built from 
   analysis of links classified by the user as "spam" or "useful". Even 
   in the abscence of spurious traffic a good Bayesian filter may help 
   tailor results.

   A sophisticated method of injecting targeted spam involves generating 
   node-ids until one pair is "close enough" to the hash of a desired 
   keyword, where "close enough" is matching the first 4 or 5 digits. 
   Even thousands of nodes will be sparse in the keyspace, so an 
   exhaustive search is not necessary.


7.7 Indexing

   A large sample of web documents suggests that there is a fairly 
   constant number of terms, something on the close order of 14 million 
   [GOOGLE].

   An unpublished survey of a large sample of search engine queries 
   suggests that there is also a fairly constant number of unique terms 
   searched, on the close order of 1 million. This is very interesting. 


Bueno                                                          [Page 16]

Internet-Draft                 Dowser/0.1                       May 2004


   Suppose we give more "weight" to words found in the query stream over 
   those that are not. As we've seen in the adventures of Internet 
   search engines over the last few years, documents often lie about
   their contents, but users rarely lie about their desires. They are 
   vauge, perhaps, but not false.

   . . . that is, until the query stream becomes a thing of value. A 
   clever person might include a popular word and a made-up one into his 
   or her web page, then send thousands of queries with both of those 
   words to artificially boost the correlation. I have a wonderful proof 
   describing how this gambit might be defeated, but unfortunately there 
   is not enough room to include it here.


8. Acknowledgements

   This specification builds on the years of work put into designing and 
   implementing Dr. Berner-Lee's HTTP protocol by many thousands of 
   people around the world. You know who you are.
   
   Ideas and algorithims were stolen liberally from Chord, Circle, and 
   Freenet.  Lots and lots of ideas from the original distributed search 
   projects, Harvest and WebAnts. 

   Special thanks to Richard Vasquez and Thomas Lackner for their 
   devious minds and help with early implementations. 
   

9. References

9.1 Informative

   [GOOGLE] Brin & Page, "The Anatomy of a Search Engine", Stanford
      University, 1997
   
   [CHORD] Stoica, et al. "A Scalable Peer-to-peer Lookup Service for 
      Internet Applications", Sigcomm 2001
      http://www.pdos.lcs.mit.edu/papers/chord:sigcomm01/
      
   [IMPL] Dowser P2P client, http://dowser.sourceforge.net
   
9.2 Normative

   [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform 
      Resource Identifiers (URI): Generic Syntax and Semantics", RFC 
      2396, August 1998.
   
   [HTTP] Feilding, et al. "Hypertext Transfer Protocol -- HTTP/1.1", 
      RFC2616, June 1999
      
   [SHA] Eastlake & Jones, US Secure Hash Algorithm 1 (SHA1), RFC3174,  
      September 2001
  
  
Bueno                                                          [Page 17]

Internet-Draft                 Dowser/0.1                       May 2004


10. Security Considerations
   
10.1 Personal Information

   Dowser deals directly with people's searching habits. While spreading 
   this information around is kind of the point, implementations SHOULD 
   take care not to "leak" any more than neccessary. For instance, it 
   may be tempting to skip the NODEFIND sequence and just issue SEARCHes 
   or URLCACHEs, since they also return node-finding information. This 
   sends people's search terms and requested urls to nodes that don't 
   strictly need it, and SHOULD be avoided.
   
   Implementations can be expected to work closely with traditional HTTP 
   applications, with their own privacy & security considerations.
   
   
10.2 Sensitive data

   Dowser was originally intended to deal only with publically 
   accessible information, but it can also be used for sharing 
   sensitive data among trusted computers, say, all the word processor 
   documents on an office LAN. For these situations, implementations 
   should use some sort of IP whitelisting or subnetting and make the 
   user aware of the risks.


11. IPR & Copyright

    The IETF invites any interested party to bring to its
    attention any copyrights, patents or patent applications, or
    other proprietary rights which may cover technology that may be
    required to practice this standard.  Please address the
    information to the IETF Executive Director.

    Copyright (C) The Internet Society (2004). All Rights
    Reserved.

    This document and translations of it may be copied and
    furnished to others, and derivative works that comment on or
    otherwise explain it or assist in its implmentation may be
    prepared, copied, published and distributed, in whole or in
    part, without restriction of any kind, provided that the above
    copyright notice and this paragraph are included on all such
    copies and derivative works.  However, this document itself may
    not be modified in any way, such as by removing the copyright
    notice or references to the Internet Society or other Internet
    organizations, except as needed for the  purpose of developing
    Internet standards in which case the procedures for copyrights
    defined in the Internet Standards process must be followed, or
    as required to translate it into languages other than English.

    The limited permissions granted above are perpetual and will
    not be revoked by the Internet Society or its successors or
    assigns.


Bueno                                                          [Page 18]

Internet-Draft                 Dowser/0.1                       May 2004


    This document and the information contained herein is provided
    on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
    ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE
    OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
    IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
    PARTICULAR PURPOSE.


12. Contact Information

   
   Carlos Bueno
   6231 SW 78th St., Ste. 20
   Miami, FL, USA 33143
   Email: carlos@bueno.org
   
   
This document expires on November 4, 2004


Bueno                                                          [Page 19]