Internet-Draft Carlos Bueno May 2004 A Distributed Web Search Protocol -- Dowser/0.1 draft-dowser-spec-00.txt Status of This Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This document Expires on November 04, 2004 Abstract Dowser is an application-level, peer-to-peer protocol for creating a searchable index and cache of documents. It is intended for both small-scale intranets and Worldwide Web-scale indexes. This document describes the messages that members, or "nodes", of the network pass to each other to distribute workload, request & respond to queries, and maintain the index. Bueno [Page 1] Internet-Draft Dowser/0.1 May 2004 Table of contents 1. Introduction ................................................. 3 1.1 Purpose .................................................... 3 1.2 Definitions ................................................ 3 1.3 URL vs. URI vs. URN ........................................ 4 1.4 Requirements ............................................... 5 2. Some Examples ................................................ 5 2.1 NODEFIND ................................................... 5 2.1.1 Node-id & seed ............................................ 6 2.1.2 Last-key .................................................. 6 2.1.3 Ring-id ................................................... 7 2.1.4 Port ...................................................... 7 2.1.5 Server Responses .......................................... 7 2.2 SEARCH ...................................................... 8 2.2.1 Expires ................................................... 9 2.2.2 Content-key ............................................... 9 2.2.3 Search syntax ............................................. 9 2.3. URL caching ............................................... 10 2.4. Content caching ........................................... 11 2.5. CRAWL ..................................................... 11 2.6.INDEXADD ................................................... 12 3. Announcing .................................................. 13 4. Redundancy .................................................. 14 5. Optional headers ............................................ 14 6. Error conditions ............................................ 14 7. Notes on implementation ..................................... 15 7.1 Ranking .................................................... 15 7.2 Proxying ................................................... 15 7.3 TCP vs UDP ................................................. 16 7.4 Ping/Pong .................................................. 16 7.5 MHTML ...................................................... 16 7.6 Spam ....................................................... 16 7.7 Indexing ................................................... 16 8. Acknowledgements ............................................ 16 9. References .................................................. 17 9.1 Informative ................................................ 17 9.2 Normative .................................................. 17 10. Security Considerations .................................... 17 10.1 Personal Information ...................................... 18 10.2 Sensitive data ............................................ 18 11. IPR & Copyright ............................................ 18 12. Contact Information ........................................ 19 Bueno [Page 2] Internet-Draft Dowser/0.1 May 2004 1. Introduction 1.1 Purpose The express purpose of the Dowser protocol is to encourage open research into web-scale, decentralized indexing systems. The specification for the Hypertext Transfer Protocol [RFC1945] has this observation: Practical information systems require more functionality than simple retrieval, including search, front-end update, and annotation. HTTP allows an open-ended set of methods to be used to indicate the purpose of a request. HTTP is in widespread use by hundreds of millions of people every day, and they issue hundreds of millions of searches for information. Most of these searches are served by centralized "search engines" that cache copies of as many websites as possible to create their indexes. The problems of scale have so far been admirably met [GOOGLE], but the operating costs of web-scale engines are now out of the reach of most Universities and corporations. Conversely, larger and larger amounts of storage and processing power are available to personal computer users every year. Dowser is an extension to HTTP that can be used by itself or added to existing HTTP implementations. Each node on the network claims a small part of the keyspace of a distributed hash table [CHORD]. Indexes of documents are keyed under the hash of the word; documents themselves are keyed by their Uniform Resource Identifier [URI]. There is also a "cache of caches", to allow popular documents or indexes to be retrieved from any node that has a copy. The syntax and formatting rules are inherited from HTTP/1.1 [RFC2616]. The required "Host" header in HTTP/1.1 does not really apply to Dowser, but including it shouldn't hurt anyone. Several other headers and return codes are adopted for use. 1.2 Definitions node In this protocol there is no difference between a "client" or a "server". They are all equal "nodes" on the network, capable of sending and serving requests. Each node is identified by a 40-digit hexadecimal number called a "node-id". They participate in creating, storing and serving a distributed hash table. This hash table can contain document caches, indexes of those documents, and other data. For sanity's sake, the term "server" can be taken to mean "responding node" and "client" to mean "requesting node" in the context of the conversation being described. Bueno [Page 3] Internet-Draft Dowser/0.1 May 2004 hash, key Hash functions are used to evenly distribute data around the network and to uniquely identify nodes and pieces of information. Dowser uses the Secure Hash Algorithim, version 2 [SHA]. The terms "hash" and "key" refer to the 40-digit hexadecimal number gotten from passing some piece of information through the SHA function, e.g.: SHA("foo") == 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33 distributed hash table A distributed hash table is a key-->value mapping that is contained on many computers, where they keys are hashed. range The "range" of a node is defined by the first and last key of the hash table it is expected to store data under. left-hand, right-hand The total keyspace is imagined as the perimeter of a circle, or a "ring". While standing in the center, points to the left generally decrease and points to the right generally increase, except at the point where "fff..." meets "000...". A useful analog is timezones, where West is left and East is right. The time is always earlier to the left except at the International Date Line. 1.3 URL vs. URI vs. URN The protocol descibed here has a focus different from most filesharing protocols. Searches for files are not broadcast but targeted; nodes are authoritative regarding the location and checksums of data in their ranges. Searches for a particular node are sent to arbitrary nodes but not broadcast per se; they are not forwarded but responded to directly. In most ad-hoc filesharing protocols the only unique identifier is the checksum. The filesystem names and metadata may be different for different copies but the checksum remains the same. In HTTP jargon this is a Universal Resource Name or URN , that which uniquely identifies the file without necessarily referencing its location or common name [RFC2396]. The content of a cached web page may change with time. This means the checksum will change while its Univeral Resource Locator, the origin of the page, remains constant. Indexes built from these cached pages are also treated as files, one index file per word or phrase. An index's Universal Resource Name is Bueno [Page 4] Internet-Draft Dowser/0.1 May 2004 that word. (More correctly it would be "dowser://foo", not "foo".) Searching in Dowser therefore starts with a lookup for the node(s) who handle the search term(s). The search terms (URN) are then sent to those nodes, resulting in either a) the content of index file or b) the latest checksum of the index file and a list of nodes that have previously downloaded a copy. Retrieving a cache of a web page is done the same way. 1.4 Requirements The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [34]. An implementation is not compliant if it fails to satisfy one or more of the MUST or REQUIRED level requirements for the protocols it implements. An implementation that satisfies all the MUST or REQUIRED level and all the SHOULD level requirements for its protocols is said to be "unconditionally compliant"; one that satisfies all the MUST level requirements but not all the SHOULD level requirements for its protocols is said to be "conditionally compliant." In reality, the complications of this protocol are mostly in the state information, not the ins and outs of the messages being passed. 2. Some Examples We will assume the reader is familiar with HTTP transactions. It is probably most instructive to give examples using the new methods and then a discussion of particulars. Elipses ("...") are used to truncate long hashes for readability. 2.1 NODEFIND NODEFINDs are the bread and butter of routing between nodes. The client wants to find the node(s) that have items under a certain hash key, in this example, "5c89379d0aa9840ac910fd8cacbde2dbf877214a". This can be a search term, url, or anything. The responding node may or may not have this key. If not, it tells the client about nodes it knows about that are closer to the target. NODEFIND 5c89379d0aa9840ac910fd8cacbde2dbf877214a Dowser/0.1 Ring-Id: deadbeef00000000000000000000000000000000 Node-Id: 12222... a5754... Last-key: 13333... Port: 4666 - Response - Bueno [Page 5] Internet-Draft Dowser/0.1 May 2004 Dowser/0.1 310 Not in my range Ring-Id: deadbeef00000000000000000000000000000000 Node-Id: 48888... de34.... Last-key: 49999... 192.0.2.10 8000 50000f... 599999... 192.0.2.20 6876 43333a... 610000... - Or - Dowser/0.1 211 That's me Ring-Id: deadbeef00000000000000000000000000000000 Node-Id: 4000... de34.... Last-key: 7aaaa... There is a lot to cover here. The first line of the request is like HTTP, identifying the protocol and version, the 'method' or action to be done, and the 'path' or key being requested. There are three additional headers which MUST be included in any message: Ring-id, Node-id, and Last-key. Additionally, the Port header MUST be included in every request. 2.1.1 Node-id & seed The node-id is comprised of two 40-digit hexadecimal numbers, separated by whitespace. The first number serves as the FIRST key in this node's range. This number MUST be generated by hashing some random data iteratively. To help ensure this has occurred, the second number (the "seed") MUST be the second-to-last result of the iterations. Node-id: b274f2e2a8d2881035af5866014e9ad5510ab15d cdd2ae2594a83ef90c05ee6014b78631db8538d8 This example (linewrapped to fit) is well-formed because the first number is the SHA hash of the second number. Servers MUST check that the node-id and its "seed" are correct. If not, they MUST send a 412 "Precondition failed" error. 2.1.2 Last-key Last-key is also an SHA hash, and is the LAST key in this node's range. The last-key may change over time, as nodes leave or join the network. Note: it has been brought up that a node might be able to route a large portion of traffic to itself by making its Last-key equal to the node-id minus 1. In Internet Protocol terms, this would be similar to setting your network address to 255.255.255.255. There is Bueno [Page 6] Internet-Draft Dowser/0.1 May 2004 nothing in the protocol to prevent this, but implementations should come up with a way to deal with it reasonably. Our feeling that it is self-correcting. If someone tries it on a network with a lot of nodes one hopes he or she has good fire-supression equipment handy; but the concern is the possible damage to the network routing. 2.1.3 Ring-id Ring-id is yet another SHA hash to identify the group of nodes it wishes to communicate with. This allows many Dowser networks to be running on an internet at once. Alternate rings are useful for testing implementations of the protocol on a large scale, or for partitioning networks in general. The "official" public ring-id: Dowser/0.1 --> deadbeef00000000000000000000000000000000 If the Ring-id header sent does not match the ring-id of the server, it MUST send a 412 "Precondition falied" message, and may provide information in the body of the response as to the reason. Note: there is a security risk in actually revealing the server's ring-id in the response, if the ring is intended for private data. 2.1.4 Port Simple enough: we need to advertise what TCP or UDP port we are listening on for future reference. 2.1.5 Server Responses If the hash is not in the listening node's range, it returns a 310 code and a list of nodes that are closer to the target. The list of nodes takes this form: IP PORT NODE-ID LAST-KEY If it is in range, it sends a 211 "That's Me" code. When overlap occurs a node may choose which one to contact first in an implementation-dependent way, such as IP or network distance, or closeness of the desired key to the node-id. Note: IPv4 addresses are used in these examples, but implementations SHOULD be written to handle IPv6 addresses as well, e.g.: FEDC:BA98:7654:3210:FEDC:BA98:7654:3210 1080:0:0:0:8:800:200C:4171 3ffe:2a00:100:7031::1 1080::8:800:200C:417A ::192.9.5.5 Bueno [Page 7] Internet-Draft Dowser/0.1 May 2004 ::FFFF:129.144.52.38 2010:836B:4179::836B:4179 2.2 SEARCH We have the basis of all conversations, and have described how nodes identify themselves and each other. Now we can go to the next level: content routing. - Request - SEARCH foo+bar+baz Dowser/0.1 Ring-Id: deadbeef00000000000000000000000000000000 Node-Id: 1d3470c... 55754... Last-key: 1e400cc... Port: 4666 - Response - Dowser/0.1 200 OK Ring-Id: deadbeef00000000000000000000000000000000 Node-Id: 48888... de34... Last-key: 49999... Expires: 864000 Content-key: 3fdb13677b10691debb3909dd917b00ee751115a URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]] URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]] ... - Or - Dowser/0.1 300 Multiple choices Ring-Id: deadbeef00000000000000000000000000000000 Node-Id: 48888... de34... Last-key: 49999... Expires: 864000 Content-key: 3fdb13677b10691debb3909dd917b00ee751115a 192.0.2.10 8000 50000f... 599999... 192.0.2.20 6876 43333a... 610000... A SEARCH request is almost exactly like a nodefind except the "path" is a list of search terms. The server determines if any of the terms fall into its range. How to break up the terms may be implementation-specific, but the simplest is to break on whitespace. If any are in-range it returns a 300 code, the "content-key" of the data (a SHA checksum of the content) a list of the nodes that have Bueno [Page 8] Internet-Draft Dowser/0.1 May 2004 requested that data. Or, it may return the index data itself. Requesting nodes SHOULD keep a copy of the data it receives for a reasonable amount of time, as hinted by the Expires: header. The data should be keyed under the content-key, for retrieval via a CACHE request. This helps distribute the workload. The index is in the form: URL TITLE AGE SNIPPET [SNIPPET [SNIPPET [...]]] And is tab-delimited. AGE is the number of seconds since the record was created. One or more "SNIPPET"s contain text that form an abstract of the document. As always, the server can alternately send the standard 310 Not in Range, or some error code. 2.2.1 Expires: Indexes and caches of web documents are not long-lived in the way a media file in traditional peer-to-peer networks are. The Expires header hints at how long a node should cache this data, expressed in seconds from the present time. 2.2.2 Content-key: This is a SHA hash of the data's content, used to positively identify it. This is different from the hash of the URL or search term. A page whose URL is "http://foo.example.com" ... and whose content is "