jsoup
jsoup is an open source Java library for working with real-world HTML. It allows you:
- scrape and parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safe white-list, to prevent XSS attacks
- output tidy HTML
jsoup is distributed under MIT license.
The source code is available at GitHub
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
install
You can download jar file or add as dependency in gradle script.
implementation 'org.jsoup:jsoup:1.13.1'
url
jsoup allows you to get the absolute url using abs:href or the Node.absUrl(key) method.
In both cases, you need to provide the base URL. When you download from a url it will be done implicitly.
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
sanitize untrusted HTML
Sometimes you need to clean up your HTML to avoid cross-site scripting (XSS) attacks. For example, if you want to allow untrusted users to provide HTML for output to your website, for example, to post comments.
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>