jsoup

jsoup is an open source Java library for working with real-world HTML. It allows you:

  • scrape and parse HTML from a URL, file, or string
  • find and extract data, using DOM traversal or CSS selectors
  • manipulate the HTML elements, attributes, and text
  • clean user-submitted content against a safe white-list, to prevent XSS attacks
  • output tidy HTML

jsoup is distributed under MIT license.

The source code is available at GitHub

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");

for (Element headline : newsHeadlines) {
  log("%s\n\t%s", 
    headline.attr("title"), headline.absUrl("href"));
}

install

You can download jar file or add as dependency in gradle script.

implementation 'org.jsoup:jsoup:1.13.1'

url

jsoup allows you to get the absolute url using abs:href or the Node.absUrl(key) method.

In both cases, you need to provide the base URL. When you download from a url it will be done implicitly.

Document doc = Jsoup.connect("http://jsoup.org").get();

Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"

sanitize untrusted HTML

Sometimes you need to clean up your HTML to avoid cross-site scripting (XSS) attacks. For example, if you want to allow untrusted users to provide HTML for output to your website, for example, to post comments.

String unsafe = 
  "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>