Jsoup是一款Java的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
可用于爬虫领域,爬虫抓取页面后,需要对页面进行解析,就可以使用Jsoup这种专门解析html页面的技术。
GitHub地址:
https://github.com/jhy/jsoup/
1.从一个URL,文件或字符串中解析HTML
2.使用DOM或CSS选择器来查找、取出数据
3.可操作HTML元素、属性、文本
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.2</version>
</dependency>
创建html测试文件
添加IO包,用于读取html文件
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.11.0</version>
</dependency>
创建test.html
文件,用于解析测试。
<!DOCTYPE html>
<html lang="en">
<meta charset="UTF-8">
<title>Title</title>
</head>
<div id="myDiv" class="myDivClass1 myDivClass2" myTag="1">
<h2 id="title">根据id查询元素</h2>
<span>根据标签获取元素</span>
<div class="myClass">根据class获取元素</div>
<span myTag="1">根据属性获取元素1</span>
<span myTag="2">根据属性获取元素2</span>
</body>
</html>
解析URL
发起请求并获取数据,封装为Document对象
@Test
public void parseUrl() throws Exception {
* 解析url地址
* URL url:访问的url
* int timeoutMillis:访问超时时间
Document doc = Jsoup.parse(new URL("http://www.baidu.com"), 5000);
String title = doc.getElementsByTag("title").first().text();
log.info("title={}", title);
解析字符串
读取一个HTML文件,获取字符串,直接输入字符串,并封装为Document对象
@Test
public void parseString() throws IOException {
String content = FileUtils.readFileToString(new File("D:\\test.html"), "utf8");
Document doc = Jsoup.parse(content);
String title = doc.getElementsByTag("title").first().text();
log.info("title={}", title);
直接解析一个HTML文件,并封装为Document对象
@Test
public void parseFile() throws Exception {
Document doc = Jsoup.parse(new File("D:\\test.html"), "utf8");
String title = doc.getElementsByTag("title").first().text();
log.info("title={}", title);
使用dom遍历文档
1.获取元素
1.根据id查询元素getElementById
2.根据标签获取元素getElementsByTag
3.根据class获取元素getElementsByClass
4.根据属性获取元素getElementsByAttribute
@Test
public void parseDOM() throws Exception {
Document doc = Jsoup.parse(new File("D:\\test.html"), "utf8");
Element element1 = doc.getElementById("title");
log.info("element1={}", element1);
Element element2 = doc.getElementsByTag("span").first();
log.info("element2={}", element2);
Element element3 = doc.getElementsByClass("myClass").first();
log.info("element3={}", element3);
Element element4 = doc.getElementsByAttribute("myTag").first();
log.info("element4={}", element4);
Element element5 = doc.getElementsByAttributeValue("myTag", "2").first();
log.info("element5={}", element5);
20:08:37.918 [main] INFO com.example.demo.test.JsoupFirstTest - element1=<h2 id="title">根据id查询元素</h2>
20:08:37.923 [main] INFO com.example.demo.test.JsoupFirstTest - element2=<span>根据标签获取元素</span>
20:08:37.923 [main] INFO com.example.demo.test.JsoupFirstTest - element3=<div class="myClass">
根据class获取元素
20:08:37.924 [main] INFO com.example.demo.test.JsoupFirstTest - element4=<span mytag="1">根据属性获取元素1</span>
20:08:37.924 [main] INFO com.example.demo.test.JsoupFirstTest - element5=<span mytag="2">根据属性获取元素2</span>
2.获取元素中的数据
1.从元素中获取id
2.从元素中获取className
3.从元素中获取属性的值attr
4.从元素中获取所有属性attributes
5.从元素中获取文本内容text
@Test
public void parseData() throws Exception {
Document doc = Jsoup.parse(new File("D:\\test.html"), "utf8");
Element element = doc.getElementById("myDiv");
String id = element.id();
log.info("id={}", id);
String className = element.className();
log.info("className={}", className);
Set<String> classSet = element.classNames();
for (String s : classSet) {
log.info("className={}", s);
String attr = element.attr("myTag");
log.info("attr={}", attr);
Attributes attributes = element.attributes();
List<Attribute> attributesList = attributes.asList();
for (Attribute attribute : attributesList) {
log.info("attribute={}", attribute);
String text = element.text();
log.info("text={}", text);
20:15:06.628 [main] INFO com.example.demo.test.JsoupFirstTest - id=myDiv
20:15:06.632 [main] INFO com.example.demo.test.JsoupFirstTest - className=myDivClass1 myDivClass2
20:15:06.632 [main] INFO com.example.demo.test.JsoupFirstTest - className=myDivClass1
20:15:06.632 [main] INFO com.example.demo.test.JsoupFirstTest - className=myDivClass2
20:15:06.632 [main] INFO com.example.demo.test.JsoupFirstTest - attr=1
20:15:06.633 [main] INFO com.example.demo.test.JsoupFirstTest - attribute=id="myDiv"
20:15:06.634 [main] INFO com.example.demo.test.JsoupFirstTest - attribute=class="myDivClass1 myDivClass2"
20:15:06.634 [main] INFO com.example.demo.test.JsoupFirstTest - attribute=mytag="1"
20:15:06.635 [main] INFO com.example.demo.test.JsoupFirstTest - text=根据id查询元素 根据标签获取元素 根据class获取元素 根据属性获取元素1 根据属性获取元素2
选择器查找元素
jsoup的elements对象支持类似于CSS (或jquery)的选择器语法,来实现非常强大和灵活的查找功能。
select方法在Document, Element,或Elements对象中都可以使用。且是上下文相关的,因此可实现指定元素的过滤,或者链式选择访问。
Select方法将返回一个Elements集合,并提供一组方法来抽取和处理结果。
选择器 | 描述 | 举例 |
---|
tagname | 通过标签元素 | span |
#id | 通过ID查找元素 | #myId |
.class | 通过class名称查找元素 | .myClass |
[attribute] | 利用属性查找元素 | [myTag] |
[attribute=value] | 利用属性值来查找元素 | [myTag=1] |
@Test
public void parseSelector() throws Exception {
Document doc = Jsoup.parse(new File("D:\\test.html"), "utf8");
Elements elements = doc.select("span");
for (Element element : elements) {
log.info("element={}", element);
log.info("----------------------------------------------------");
Element id = doc.select("#title").first();
log.info("id={}", id);
log.info("----------------------------------------------------");
Element classs = doc.select(".myClass").first();
log.info("classs={}", classs);
log.info("----------------------------------------------------");
Element myTag = doc.select("[myTag]").first();
log.info("myTag={}", myTag);
log.info("----------------------------------------------------");
Elements myTag1 = doc.select("[myTag=1]");
for (Element element : myTag1) {
log.info("element={}", element);
20:19:25.247 [main] INFO com.example.demo.test.JsoupFirstTest - element=<span>根据标签获取元素</span>
20:19:25.251 [main] INFO com.example.demo.test.JsoupFirstTest - ----------------------------------------------------
20:19:25.251 [main] INFO com.example.demo.test.JsoupFirstTest - element=<span mytag="1">根据属性获取元素1</span>
20:19:25.252 [main] INFO com.example.demo.test.JsoupFirstTest - ----------------------------------------------------
20:19:25.252 [main] INFO com.example.demo.test.JsoupFirstTest - element=<span mytag="2"
>根据属性获取元素2</span>
20:19:25.252 [main] INFO com.example.demo.test.JsoupFirstTest - ----------------------------------------------------
20:19:25.252 [main] INFO com.example.demo.test.JsoupFirstTest - id=<h2 id="title">根据id查询元素</h2>
20:19:25.253 [main] INFO com.example.demo.test.JsoupFirstTest - ----------------------------------------------------
20:19:25.253 [main] INFO com.example.demo.test.JsoupFirstTest - classs=<div class="myClass">
根据class获取元素
20:19:25.253 [main] INFO com.example.demo.test.JsoupFirstTest - ----------------------------------------------------
20:19:25.253 [main] INFO com.example.demo.test.JsoupFirstTest - myTag=<div id="myDiv" class="myDivClass1 myDivClass2" mytag="1">
<h2 id="title">根据id查询元素</h2> <span>根据标签获取元素</span>
<div class="myClass">
根据class获取元素
</div> <span mytag="1">根据属性获取元素1</span> <span mytag="2">根据属性获取元素2</span>
20:19:25.253 [main] INFO com.example.demo.test.JsoupFirstTest - ----------------------------------------------------
20:19:25.253 [main] INFO com.example.demo.test.JsoupFirstTest - element=<div id="myDiv" class="myDivClass1 myDivClass2" mytag="1">
<h2 id="title">根据id查询元素</h2> <span>根据标签获取元素</span>
<div class="myClass">
根据class获取元素
</div> <span mytag="1">根据属性获取元素1</span> <span mytag="2">根据属性获取元素2</span>
20:19:25.253 [main] INFO com.example.demo.test.JsoupFirstTest - element=<span mytag="1">根据属性获取元素1</span>
Selector选择器组合
选择器 | 描述 | 举例 |
---|
el#id | 元素+ID | h1#myId |
el.class | 元素+class | li.myClass |
el[attr] | 元素+属性名 | span[myTag] |
ancestor child | 查找某个元素下子元素 | #myId p |
parent > child | 查找某个父元素下的直接子元素 | #myId>p |
parent > * | 查找某个父元素下所有直接子元素 | #myId > * |
@Test
public void parseSelector2() throws Exception {
Document doc = Jsoup.parse(new File("D:\\test.html"), "utf8");
Element element1 = doc.select("div#myDiv").first();
log.info("element1={}", element1);
log.info("----------------------------------------------------");
Element element2 = doc.select("div.myClass").first();
log.info("element2={}", element2);
log.info("----------------------------------------------------");
Element element3 = doc.select("span[myTag]").first();
log.info("element3={}", element3);
log.info("----------------------------------------------------");
Elements elements1 = doc.select("#myDiv span");
for (Element element : elements1) {
log.info("element={}", element);
log.info("----------------------------------------------------");
Elements elements2 = doc.select("#myDiv>h2");
for (Element element : elements2) {
log.info("element={}", element);
log.info("----------------------------------------------------");
Elements elements3 = doc.select("#myDiv>*");
for (Element element : elements3) {
log.info("element={}", element);
20:23:17.938 [main] INFO com.example.demo.test.JsoupFirstTest - element1=<div id="myDiv" class="myDivClass1 myDivClass2" mytag="1">
<h2 id="title">根据id查询元素</h2> <span>根据标签获取元素</span>
<div class="myClass">
根据class获取元素
</div> <span mytag="1">根据属性获取元素1</span> <span mytag="2">根据属性获取元素2</span>
20:23:17.944 [main] INFO com.example.demo.test.JsoupFirstTest - ----------------------------------------------------
20:23:17.945 [main] INFO com.example.demo.test.JsoupFirstTest - element2=<div class="myClass">
根据class获取元素
20:23:17.945 [main] INFO com.example.demo.test.JsoupFirstTest - ----------------------------------------------------
20:23:17.945 [main] INFO com.example.demo.test.JsoupFirstTest - element3=<span mytag="1">根据属性获取元素1</span>
20:23:17.945 [main] INFO com.example.demo.test.JsoupFirstTest - ----------------------------------------------------
20:23:17.946 [main] INFO com.example.demo.test.JsoupFirstTest - element=<span>根据标签获取元素</span>
20:23:17.946 [main] INFO com.example.demo.test.JsoupFirstTest - element=<span mytag="1">根据属性获取元素1</span>
20:23:17.946 [main] INFO com.example.demo.test.JsoupFirstTest - element=<span mytag="2">根据属性获取元素2</span>
20:23:17.946 [main] INFO com.example.demo.test.JsoupFirstTest - ----------------------------------------------------
20:23:17.947 [main] INFO com.example.demo.test.JsoupFirstTest - element=<h2 id="title">根据id查询元素</h2>
20:24:22.333 [main] INFO com.example.demo.test.JsoupFirstTest - ----------------------------------------------------
20:24:22.333 [main] INFO com.example.demo.test.JsoupFirstTest - element=<h2 id="title">根据id查询元素</h2>
20:24:22.333 [main] INFO com.example.demo.test.JsoupFirstTest - element=<span>根据标签获取元素</span>
20:24:22.333 [main] INFO com.example.demo.test.JsoupFirstTest - element=<div class="myClass">
根据class获取元素
20:24:22.333 [main] INFO com.example.demo.test.JsoupFirstTest - element=<span mytag="1">根据属性获取元素1</span>
20:24:22.333 [main] INFO com.example.demo.test.JsoupFirstTest - element=<span mytag="2">根据属性获取元素2</span>
复制代码
- 116.0w
-
程序员老鱼
掘金·日新计划
ChatGPT
OpenAI
- 1655
-
OBKoro1
Firefox
JavaScript