pt.tumba.links
Class WebGraph

java.lang.Object
  extended by pt.tumba.links.WebGraph

public class WebGraph
extends java.lang.Object

This class implements a memory Data Structure for storing graphs.

A large amount of research has recently focused on the graph structure (or link structure) of the World Wide Web, which has proven to be extremely useful for improving the performance of search engines and other tools for navigating the web. For example, the Pagerank algorithm of Brin and Page, used in the Google search engine, and the HITS algorithm of Kleinberg both rank pages according to the number and importance of other pages that link to them.

This class provides the methods needed to efficiently compute with graphs and to experiment with such algorithms, using main memory for storage.

Author:
Bruno Martins

Field Summary
private  java.util.Map IdentifyerToURL
          A Map storing relationships from numeric identifiers to URLs, usefull for storing Web graphs
private  java.util.Map InLinks
          A Map storing InLinks.
private  int nodeCount
          The number of nodes in the graph
private  java.util.Map OutLinks
          A Map storing OutLinks.
private  java.util.Map URLToIdentifyer
          A Map storing relationships from URLs to numeric identifiers, usefull for storing Web graphs
 
Constructor Summary
WebGraph()
          Constructor for WebGraph
WebGraph(java.io.File file)
          Constructor for WebGraph, reading data from a text file.
 
Method Summary
private  java.lang.Double addLink(java.lang.Integer fromLink, java.lang.Integer toLink, java.lang.Double weight)
          Adds an association between two given nodes in the graph.
 void addLink(java.lang.String link)
          Adds a node to the graph
 java.lang.Double addLink(java.lang.String fromLink, java.lang.String toLink, java.lang.Double weight)
          Adds an association between two given nodes in the graph.
 java.lang.String IdentifyerToURL(java.lang.Integer id)
          Returns the URL associated with a given identifyer
 java.lang.Double inLink(java.lang.Integer fromLink, java.lang.Integer toLink)
          Returns the connection strength between two nodes, assuming there is a connection from the first to the second.
 java.lang.Double inLink(java.lang.String fromLink, java.lang.String toLink)
          Returns the connection strength between two nodes, assuming there is a connection from the first to the second.
 java.util.Map inLinks(java.lang.Integer link)
          Returns a Map of the nodes that connect to a given node in the graph.
 java.util.Map inLinks(java.lang.String URL)
          Returns a Map of the nodes that connect to a given node in the graph.
 int numNodes()
          Returns the number of nodes in the graph
 java.lang.Double outLink(java.lang.Integer fromLink, java.lang.Integer toLink)
          Returns the connection strength between two nodes, assuming there is a connection from the first to the second.
 java.lang.Double outLink(java.lang.String fromLink, java.lang.String toLink)
          Returns the connection strength between two nodes, assuming there is a connection from the first to the second.
 java.util.Map outLinks(java.lang.Integer link)
          Returns a Map of the nodes that are connected from a given node in the graph.
 java.util.Map outLinks(java.lang.String URL)
          Returns a Map of the nodes that are connected from a given node in the graph.
 void removeInternalLinks()
          Remove nodes which correspond to an internal link.
 void removeNepotistic()
          Remove nodes which correspond to nepotistic links.
 void removeStopLinks(java.util.Map stopURLs)
          Remove nodes which correspond to stop URLs.
 void removeStopLinks(java.lang.String[] stopURLs)
          Remove nodes which correspond to stop URLs.
 void transformUnidirectional()
          Transforms a bi-directional graph to an uni-directional equivalent.
 java.lang.Integer URLToIdentifyer(java.lang.String URL)
          Returns the identifyer associated with a given URL
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

IdentifyerToURL

private java.util.Map IdentifyerToURL
A Map storing relationships from numeric identifiers to URLs, usefull for storing Web graphs


URLToIdentifyer

private java.util.Map URLToIdentifyer
A Map storing relationships from URLs to numeric identifiers, usefull for storing Web graphs


InLinks

private java.util.Map InLinks
A Map storing InLinks. For each identifyer (the key), another Map is stored, containing for each inlink an associated "connection weight"


OutLinks

private java.util.Map OutLinks
A Map storing OutLinks. For each identifyer (the key), another Map is stored, containing for each inlink an associated "connection weight"


nodeCount

private int nodeCount
The number of nodes in the graph

Constructor Detail

WebGraph

public WebGraph()
Constructor for WebGraph


WebGraph

public WebGraph(java.io.File file)
         throws java.io.IOException,
                java.io.FileNotFoundException
Constructor for WebGraph, reading data from a text file. Each line of the file contains an association in the form: http://url1.com -> http://url2.com 1.0 Stating that "http://url1.com" contains an outlink to "http://url2.com", with an associated connection strength of 1.0

Parameters:
aux - The name of the file
Throws:
java.io.IOException - An error occured while reading the file
java.io.FileNotFoundException - An error occured while reading the file
Method Detail

URLToIdentifyer

public java.lang.Integer URLToIdentifyer(java.lang.String URL)
Returns the identifyer associated with a given URL

Parameters:
URL - The URL
Returns:
The identifyer associated with the given URL

IdentifyerToURL

public java.lang.String IdentifyerToURL(java.lang.Integer id)
Returns the URL associated with a given identifyer

Parameters:
id - The identifyer
Returns:
The URL associated with the given identifyer

addLink

public void addLink(java.lang.String link)
Adds a node to the graph

Parameters:
link - The URL associated with the added node

addLink

public java.lang.Double addLink(java.lang.String fromLink,
                                java.lang.String toLink,
                                java.lang.Double weight)
Adds an association between two given nodes in the graph. If the corresponding nodes do not exists, this method creates them. If the connection already exists, the strength value is updated.

Parameters:
fromLink - The URL for the source node in the graph
fromLink - The URL for the target node in the graph
fromLink - The strength to associate with the connection
Returns:
The strength associated with the connection

addLink

private java.lang.Double addLink(java.lang.Integer fromLink,
                                 java.lang.Integer toLink,
                                 java.lang.Double weight)
Adds an association between two given nodes in the graph. If the corresponding nodes do not exists, this method creates them. If the connection already exists, the strength value is updated.

Parameters:
fromLink - The identifyer for the source node in the graph
fromLink - The identifyer for the target node in the graph
fromLink - The strength to associate with the connection
Returns:
The strength associated with the connection

inLinks

public java.util.Map inLinks(java.lang.String URL)
Returns a Map of the nodes that connect to a given node in the graph. Each mapping contains the identifyer for a node and the associated connection strength.

Parameters:
URL - The URL for the node in the graph
Returns:
A Map of the nodes that connect to the given node in the graph.

inLinks

public java.util.Map inLinks(java.lang.Integer link)
Returns a Map of the nodes that connect to a given node in the graph. Each mapping contains the identifyer for a node and the associated connection strength.

Parameters:
link - The identifyer for the node in the graph
Returns:
A Map of the nodes that connect to the given node in the graph.

outLinks

public java.util.Map outLinks(java.lang.String URL)
Returns a Map of the nodes that are connected from a given node in the graph. Each mapping contains the identifyer for a node and the associated connection strength.

Parameters:
URL - The URL for the node in the graph
Returns:
A Map of the nodes that are connected from the given node in the graph.

outLinks

public java.util.Map outLinks(java.lang.Integer link)
Returns a Map of the nodes that are connected from a given node in the graph. Each mapping contains the identifyer for a node and the associated connection strength.

Parameters:
link - The URL for the node in the graph
Returns:
A Map of the nodes that are connected from the given node in the graph.

inLink

public java.lang.Double inLink(java.lang.String fromLink,
                               java.lang.String toLink)
Returns the connection strength between two nodes, assuming there is a connection from the first to the second. If no connection exists, a link strength of zero is returned.

Parameters:
fromLink - The source link
toLink - The target link
Returns:
The strenght for the connection between fromLink and toLink ( fromLink -> toLink )
See Also:
inLink

outLink

public java.lang.Double outLink(java.lang.String fromLink,
                                java.lang.String toLink)
Returns the connection strength between two nodes, assuming there is a connection from the first to the second. If no connection exists, a link strength of zero is returned.

Parameters:
fromLink - The source link
toLink - The target link
Returns:
The strenght for the connection between fromLink and toLink ( fromLink -> toLink )
See Also:
outLink

inLink

public java.lang.Double inLink(java.lang.Integer fromLink,
                               java.lang.Integer toLink)
Returns the connection strength between two nodes, assuming there is a connection from the first to the second. If no connection exists, a link strength of zero is returned.

Parameters:
fromLink - An identifyer for the source link
toLink - An identifyer for the target link
Returns:
The strenght for the connection between fromLink and toLink ( fromLink -> toLink )
See Also:
outLink

outLink

public java.lang.Double outLink(java.lang.Integer fromLink,
                                java.lang.Integer toLink)
Returns the connection strength between two nodes, assuming there is a connection from the first to the second. If no connection exists, a link strength of zero is returned.

Parameters:
fromLink - An identifyer for the source link
toLink - An identifyer for the target link
Returns:
The strenght for the connection between fromLink and toLink ( fromLink -> toLink )
See Also:
inLink

transformUnidirectional

public void transformUnidirectional()
Transforms a bi-directional graph to an uni-directional equivalent. The connection strenght between two nodes A and B that are inter-connected in the bi-directional graph is transformed into MAX(weight_inlink(A,B),weight_outlink(A,B))


removeInternalLinks

public void removeInternalLinks()
Remove nodes which correspond to an internal link. In a Web Graph, internal links are those made to pages that are situated on the same host.


removeNepotistic

public void removeNepotistic()
Remove nodes which correspond to nepotistic links. In a Web Graph, nepotistic links are tipically those made to pages that are situated on the same host, correspondig to links made for hypertext navigational purposes rather than semantic similarity. See the paper "Recognizing Nepotistic Links on the Web" by Brian Davison, presented at the AAAI-2000 Workshop on Artificial Intelligence for Web Search, Austin, TX, July 30, and published in Artificial Intelligence for Web Search, Technical Report WS-00-01, pp. 23-28, AAAI Press.


removeStopLinks

public void removeStopLinks(java.lang.String[] stopURLs)
Remove nodes which correspond to stop URLs.

Parameters:
stopURLs - An array of Strings with the Stop URLs

removeStopLinks

public void removeStopLinks(java.util.Map stopURLs)
Remove nodes which correspond to stop URLs. In a Web Graph, stop URLs correspond to very frequent pages. A link from/to such an URLs usually does not imply semantic similarity.

Parameters:
stopURLs - A Map where keys are the Stop URLs

numNodes

public int numNodes()
Returns the number of nodes in the graph

Returns:
The number of nodes in the graph