-
public class URLUtilUtility class for URL analysis TODO: merge with ai.platon.pulsar.common.url.Urls
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description public enumURLUtil.GroupMode
-
Method Summary
Modifier and Type Method Description final StringgetHost(String url)final StringgetHost(String url, URLUtil.GroupMode groupMode)final StringgetHost(String url, String defaultHost, URLUtil.GroupMode groupMode)final StringgetHost(URL url, String defaultHost, URLUtil.GroupMode groupMode)final StringgetHost(URL url, URLUtil.GroupMode groupMode)final StringgetDomainName(URL url)Returns the domain name of the url. final StringgetDomainName(String url)Returns the domain name of the url. final StringgetDomainName(String url, String defaultDomain)final BooleanisSameDomainName(URL url1, URL url2)Returns whether the given urls have the same domain name. final BooleanisSameDomainName(String url1, String url2)Returns whether the given urls have the same domain name. final DomainSuffixgetDomainSuffix(URL url)Returns the DomainSuffix corresponding to the last public part of the hostname final DomainSuffixgetDomainSuffix(DomainSuffixes tlds, URL url)final DomainSuffixgetDomainSuffix(DomainSuffixes tlds, String url)Returns the DomainSuffix corresponding to the last public part of the hostname final List<String>getHostBatches(URL url)Partitions of the hostname of the url by ". final List<String>getHostBatches(String url)Partitions of the hostname of the url by ". final static StringchooseRepr(String src, String dst, Boolean temp)Given two urls, a src and a destination of a redirect, it returns the representative url. final StringgetHostName(String url)Returns the lowercased hostname for the url or null if the url is not well formed. final StringgetHostName(String url, String defaultValue)final StringgetQuery(String url)Returns the path for the url. final static StringtoASCII(String url)final static StringtoUNICODE(String url)-
-
Method Detail
-
getHost
final String getHost(String url, URLUtil.GroupMode groupMode)
-
getHost
final String getHost(String url, String defaultHost, URLUtil.GroupMode groupMode)
-
getHost
final String getHost(URL url, String defaultHost, URLUtil.GroupMode groupMode)
-
getHost
final String getHost(URL url, URLUtil.GroupMode groupMode)
-
getDomainName
final String getDomainName(URL url)
Returns the domain name of the url. The domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example <br></br>
getDomainName(conf, new URL(http://lucene.apache.org/))* <br></br> will return <br></br>apache.org
-
getDomainName
final String getDomainName(String url)
Returns the domain name of the url. The domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example <br></br>
getDomainName(conf, new http://lucene.apache.org/)* <br></br> will return <br></br>apache.org
-
getDomainName
final String getDomainName(String url, String defaultDomain)
-
isSameDomainName
final Boolean isSameDomainName(URL url1, URL url2)
Returns whether the given urls have the same domain name. As an example, <br></br>
isSameDomain(new URL("http://lucene.apache.org") , new URL("http://people.apache.org/")) <br></br> will return true.
-
isSameDomainName
final Boolean isSameDomainName(String url1, String url2)
Returns whether the given urls have the same domain name. As an example, <br></br>
isSameDomain("http://lucene.apache.org" ,"http://people.apache.org/") <br></br> will return true.
-
getDomainSuffix
final DomainSuffix getDomainSuffix(URL url)
Returns the DomainSuffix corresponding to the last public part of the hostname
-
getDomainSuffix
final DomainSuffix getDomainSuffix(DomainSuffixes tlds, URL url)
-
getDomainSuffix
final DomainSuffix getDomainSuffix(DomainSuffixes tlds, String url)
Returns the DomainSuffix corresponding to the last public part of the hostname
-
getHostBatches
final List<String> getHostBatches(URL url)
Partitions of the hostname of the url by "."
-
getHostBatches
final List<String> getHostBatches(String url)
Partitions of the hostname of the url by "."
-
chooseRepr
final static String chooseRepr(String src, String dst, Boolean temp)
Given two urls, a src and a destination of a redirect, it returns the representative url.
This method implements an extended version of the algorithm used by the Yahoo! Slurp crawler described here:<br></br> How * does the Yahoo! webcrawler handle redirects?<br></br><br></br>
Choose target url if either url is malformed.
If different domains the keep the destination whether or not the redirect is temp or perm
a.com -> b.com*
If the redirect is permanent and the source is root, keep the source.
*a.com -> a.com?y=1 || *a.com -> a.com/xyz/index.html
If the redirect is permanent and the source is not root and the destination is root, keep the destination
a.com/xyz/index.html -> a.com*
If the redirect is permanent and neither the source nor the destination is root, then keep the destination
a.com/xyz/index.html -> a.com/abc/page.html*
If the redirect is temporary and source is root and destination is not root, then keep the source
*a.com -> a.com/xyz/index.html
If the redirect is temporary and source is not root and destination is root, then keep the destination
a.com/xyz/index.html -> a.com*
If the redirect is temporary and neither the source or the destination is root, then keep the shortest url. First check for the shortest host, and if both are equal then check by path. Path is first by length then by the number of / path separators.
a.com/xyz/index.html -> a.com/abc/page.html*
*www.a.com/xyz/index.html -> www.news.a.com/xyz/index.html
If the redirect is temporary and both the source and the destination are root, then keep the shortest sub-domain
*www.a.com -> www.news.a.com
<br></br> While not in this logic there is a further piece of representative url logic that occurs during indexing and after scoring. During creation of the basic fields before indexing, if a url has a representative url stored we check both the url and its representative url (which should never be the same) against their linkrank scores and the highest scoring one is kept as the url and the lower scoring one is held as the orig url inside of the index.
- Parameters:
src- The source url.dst- The destination url.temp- Is the redirect a temporary redirect.
-
getHostName
final String getHostName(String url)
Returns the lowercased hostname for the url or null if the url is not well formed.
- Parameters:
url- The url to check.
-
getHostName
final String getHostName(String url, String defaultValue)
-
getQuery
final String getQuery(String url)
Returns the path for the url. The path consists of the protocol, host, and path, but does not include the query string. The host is lowercased but the path is not.
- Parameters:
url- The url to check.
-
-
-
-