Lately, there has been some researcher press about Deep Webs. Now,
these are not your Deep Throat type of sources. Rather, they are
part of the Web that does not turn up in search engines like Google
or Yahoo,
although together the major search engines can crawl through almost
one-third of these sites. Deep Web was once called "Invisible
Web" (coined by Dr. Jill Ellsworth in 1994), which I think
is still a better term. Deep Web actually means an accessible Web
site that not indexable or queryable by conventional search engines.
Invisible Web means not visible, and has wider coverage; it includes
both accessible Web sites (but not indexable or queryable) and non-accessible
Web sites (Intranets, passworded, non-coded).
Normally, indexed Web sites are displayed when you retrieve pages
from the various Web search engines. Subject directories such as
Yahoo will also display these sites. Some people call this the Surface
Web. The Deep Web (or Invisible Web) is what you cannot retrieve,
for you cannot see it in your search statements. Plus of course
all the URLs contained in these types of sites. There are seven
main divisions here:
1) The HTTP command is but a subset of Internet content. There
are also FTP (file transfer), E-mail, news, Telnet, and pre-Web
Gopher), which are not searchable.
2) Excluded pages: not every Web site wants to be totally included
in search engine reports. In the HTML source code there is a provision
and procedure for turning away the spider search bots so that they
don't report that particular page. In other words, the URL has been
turned off by the code.
3) Databases: many Web sites have the contents of thousands of
specialized searchable databases that you can search via the Web.
You can get results from these databases, but only in answer to
your one specific query. You cannot access the whole database. In
fact, the database may not even be stored online. You only wake
it up when it is needed for a response. "It is easier and cheaper
to dynamically generate the answer page for each query than to store
all the possible pages containing all the possible answers to all
the possible queries people could make to the database" (Berkeley,
see below). And thus the search devices cannot find or create these
pages. Tabular formats are a bitch to display without appropriate
software to do so. Even simple layouts such as crossword puzzles
have some display component, and hence are normally not indexed
(check out www.ecrostic.com).
Databases with tables created by Access, Oracle, SQL Server and
DB2 are accessible only by query. There's a lot of information out
there on the Web via databases. Content on the Deep Web may be 500
times larger than the normal Google-searchable Web. The 60 largest
Deep Web sources contain 84 billion pages of content. That's about
750 terabytes of information. Top dogs are the US National Climatic
Data Center, 366,000 GigaB, 42 billion records; US NASA EOSDIS,
219,000 GigaB, 25 billion records; and US National Oceanographic
Data Center, a mere 33,000 GigaB, 4 billion records. By contrast,
Google indexes only about 6 billion pages. And 95% of the Deep Web
is publicly-accessible information, not subject to fees or subscriptions.
Awesome
.
4) For a variety of technical reasons (easy to understand, but
long and cumbersome to explain), there are "static" pages
on the Web. These reside on servers, waiting to be retrieved when
their URL is used in an HTTP command. But they are not linked, and
hence spiders do not find them.
5) Some sites require a password or loginID, and these sites are
closed to spiders. Passworded sites include indexing services, encyclopedias,
directories, Lexis and Nexis. In fact, any site that is not free
requires a password. There are thousands of such sites, although
some will let you in with teasers or partials, such as the Wall
Street Journal. Yahoo in 2005 made a small part of the
Deep Web searchable by creating Yahoo Subscriptions which searches
through a few subscription-only Web sites.
6) Non-HTML formatted pages: these include programming languages
which have codes that are incompatible with HTML, although the links
can be indexed but not the actual pages. Search engines have a hard
time with Adobe .pdf files (although Google has a reformatting tool),
image databases, spreadsheets (.xls files), multimedia files, PostScript
(.ps), Flash, Shockwave, PowerPoint (.ppt), and even wordprocessing
programs (Word .doc, WordPerfect .wp). There is no problem downloading
these materials once they are found; the major trick is finding
them in the first place!
7) Script-based pages with a ? (question mark) in their URL: these
are particularly devilish for spiders to locate. Most spiders do
not return the URL because of script problems and, believe it or
not, spider traps.
The basic Invisible Web, of course, are the various Intranets put
up by businesses, governments, and universities. These are locally
connected Web sites meant for just the corporation's use: sometimes
passwords are required. All manner of documents, many unclassified,
are posted - terabytes of information. And they are a major concern
for internal security since they can be hacked and also accessed
by rogue employees. There is no outside index to these sites, since
they are just local. All are hidden behind firewalls. I cannot tell
you how many times people have told me that a particular document
is on a Web site - "just go over and follow the links"
- only to find out that it is on their Intranet and hence inaccessible
to me. Actually, I can tell you: about a score of times
Other major invisible content includes static online library catalogues,
hidden portions of major Web sites, schedules and maps, complete
databases, tables of statistics especially in spreadsheets, phone
books, people finders (lists of professionals), patents, laws, dictionaries,
Web store or Web-auction products, newspaper archives, many blogs,
multimedia and graphical files. The Invisible Web is the fastest
growing category of new information on the Internet.
Also, "dynamically changing new information" will be
part of the Invisible Web. This includes news, job postings, travel
data (airline flights, hotels, etc.), stock market postings.
How do you find Deep Web? One way is through academic search tools
such as Infomine,
Librarians
Index, and AcademicInfo.
You could try Direct
Search at www.freepint.com/gary/direct.htm. There is
also www.profusion.com, and www.completeplanet.com,.
Another way is through your usual search engine. Just type in a
short subject term with the word "database" (e.g., biomedical
database). If the database includes the word "database",
then bingo! (Bob's your uncle?). If you drill through a directory
such as Yahoo, then be sure to also use the term "database":
this will pick up additional listings. Many search engines feature
searchable databases as part of their service. Google, for example,
has separate searches for audio-visual material, images, news, and
non-HTML formats. These are just one click away from the main HTML
search.
Some interesting Deep Web sites include:
* AnimalSearch (animalsearch.net):
family-safe animal-related sites, search by group, type, and regions.
* Educator's Reference Desk (www.eduref.org):
contains 2000+ lesson plans, 3000+ links to value-added online education
information, and 200+ question archive. It also provides access
to the ERIC database -- the world's largest source of information
on education research & practice, including free, full-text
expert digest reports, and it also links you to the Gateway to Educational
Materials (GEM), which "provides quick and easy access to over
40,000 educational resources found on various US federal, state,
university, non-profit and commercial Internet sites."
* NatureServe Explorer (www.natureserve.org/explorer):
"information on more than 65,000 plants, animals, and ecosystems
of the United States and Canada. Explorer includes particularly
in-depth coverage for rare and endangered species."
* Nuclear Explosions Database (www.ga.gov.au/oracle/nukexp_form.jsp):
Geoscience Australia's database provides location, time, & size
of explosions worldwide since 1945. Click on "databases".
* On-Line Encyclopedia of Integer Sequences (www.research.att.com/~njas/sequences):
"Type in a series of numbers and this database will complete
the sequence and provide the sequence name, along with its mathematical
formula, structure, references, and links."
* PubMed (www.ncbi.nlm.nih.gov/entrez/query.fcgiwww.ncbi.nlm.nih.gov/entrez/query.fcgi):
access to 16 million+ MEDLINE citations, including links to full
text articles & related resources. PubMed Central (PMC) is an
e-archive of free, full text articles from 200+ life sciences journals,
as well as Bookshelf, "a growing collection of [full text]
biomedical books (50+) that can be searched directly." Plus
the global NCBI 'Entrez' search engine for their many life sciences
databases.
* FindArticles (www.findarticles.com):
now searching 10 million+ articles from "leading academic,
industry and general interest publications."
* MagPortal.com (magportal.com):
freely available magazine articles on the Web, using keyword searching
or category browsing methods.
* Directory of Open Access Journals (www.doaj.org):
one stop open access directory, providing no-cost access to the
full text of over 2,000 journals, with over 500 journals searchable
on the article level (over 83,000 articles available) in the science
and humanities/social sciences
* Cryptome (cryptome.org):
specializes in posting both previously classified or under-publicized
US federal documents, along with similar documents from other jurisdictions.
There could be half-a-dozen posted every business day. Just go over
to the site, and the home page lists the latest docs. Typical titles
include "Expansion of the Strategic Petroleum Reserve",
"Calendar of 2,482 US Military dead in Iraqi War", "Security
Measures for Radioactive Materials", "Outer Continental
Shelf Polluters Fined", "CIA Creation Documents".
There is also an index to off-site documents, dealing with topics
such as the Israeli Lobby and US foreign policy, Al Qaeda documents,
New York City public safety.
Cryptome is a true Web site, with multiple links to other similar
document retrieval efforts. You could do worse than beginning with
Cryptome for searches involving nefarious actions of government.
The site also has a searchable data DVD of its archives of over
33,000 files (since June 1996), just under 3 GB worth.
There is actually a firm promising to locate Deep Web material.
It is BrightPlanet (www.brightplanet.com).
Their mission statement: "BrightPlanet applies unique and fully
automated technology to internet search site identification, retrieval,
qualification, classification, summarization, characterization and
publication". Currently, BrightPlanet software is configured
to query 70,000+ Deep Web sources. It'll even walk your dog
For the immediate future, you should expect a big impact from two
sources.
One is the court system. The
Canadian Judicial Council (the organization of Canada's
top judges) has recommended that access to court records via the
Internet be restricted. Many of these may be moving over to the
local intranet and never accessible via the open Internet. You'll
soon have to visit your nearest courthouse to view legal documents,
much as you have to now just to view the paper copies.
Remote access would still be available for judicial decisions and
case information, but not to affidavits, motion records, and pleadings.
All in the interests of privacy and identify theft. It is one thing
to have publicly available documents at the courthouse (you must
first determine which courthouse and you must ask for the right
piece of paper), but it is quite another thing to have publicly
available documents floating out on the Internet where just anybody
can read them. Yes, they are public, but only in paper form and
locally disseminated
U.S. courts are still more open. Documents
from scores of federal courts can be downloaded through PACER (Public
Access to Court Electronic Records) for a small fee, by the page.
Another is change of ownership. While most of the databases within
the Deep Web are government-owned or non-profit, there are still
vast areas such as E-mail and FTP which are in private hands. Every
time someone buys an Internet property, there are policy changes.
What should we expect with the newest batch of dot com purchases
by the media itself? How will this play out for searching for data?
NBC Universal has bought iVillage,
the top women's oriented site on the Internet, with over 30 million
unique visitors a month. News Corp (Murdoch) has bought MySpace,
the fastest growing social networking site on the Web, about 50
million unique visitors a month. News Corp also bought IGN,
a top gaming and entertainment site for young hot males, with under
20 million unique users a month. Viacom (owner of MTV and Paramount)
has bought Neopets,
a young person's community site with virtual pets. Viacom also has
bought iFilm
(where users track the film industry and post their own videos),
GameTrailers
(a competitor to IGN with more hot males), and GoCityKids
(via Nickelodeon). The New
York Times has bought About.Com,
an online advice site with over 60 million unique users a month.
Other hot properties appear to be photo- and video-sharing sites.
Murdoch still has $2 billion earmarked for these purchases, coming
up real soon. The big audiences in all the new acquisitions can
link to each other within and without their communities. And they
could be susceptible to database searching by new owners or positioned
for a sell off of contents to database searchers.
For more details on the Invisible Web and the Deep Web, try these
URLs:
www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html
http://www.internettutorials.net/deepweb.htm
www.brightplanet.com/deepcontent/deep_web_faq.asp
Published in Sources
58, Summer 2006.