最近用到了Java解析Html的一个库Jsoup,所以下面这篇文章主要给大家介绍了关于Java如何解析html中的内容并存到数据库的相关资料,文中通过实例代码介绍的非常详细,需要的朋友可以参考下
jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jquery的操作方法来取出和操作数据。它是基于MIT协议发布的。
//获取html的文档对象
Document doc = Jsoup.parse("http://www.dangdang.com");
//获取页面下id="content"的标签
Element content = doc.getElementById("content");
//获取页面下的a标签
Elements links = content.getElementsByTag("a");
for (Element link : links) {
//获取a标签下的href的属性值
String linkHref = link.attr("href");
//获取a标签下的文本内容
String linkText = link.text();
Elements这个对象提供了一系列类似于DOM的方法来查找元素,抽取并处理其中的数据。具体如下:
1、查找元素
getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key) (and related methods)
Element siblings: siblingElements(), firstElementSibling(), lastElementSibling();nextElementSibling(), previousElementSibling()
Graph: parent(), children(), child(int index)
2、元素数据
attr(String key)获取属性
attr(String key, String value)设置属性
attributes()获取所有属性
id(), className() and classNames()
text()获取文本内容
text(String value) 设置文本内容
html()获取元素内
HTMLhtml(String value)设置元素内的HTML内容
outerHtml()获取元素外HTML内容
data()获取数据内容(例如:script和style标签)
tag() and tagName()
3、操作HTML和文本
append(String html), prepend(String html)
appendText(String text), prependText(String text)
appendElement(String tagName), prependElement(String tagName) html(String value)
三、开始爬取网站数据
直接上代码:
Test.java:
@Slf4j
@SpringBootTest
class Test {
@Resource
private PositionService positionService;
* 爬取省市区网站
@Test
public void test2() throws InterruptedException {
//一共五级
for (int i = 0 ; i < 5 ; i++) {
if (i == 0) {
List<PositionEntity> positionEntities = PositionUtils.reqPosition(PositionUtils.URL_HEAD);
savePosition(positionEntities, null, i);
continue;
List<Position> positions = positionService.findListByLevel(i);
for (Position parentPosition : positions) {
List<PositionEntity> positionEntities = PositionUtils.reqPosition(String.format("%s%s%s", PositionUtils.URL_HEAD, parentPosition.getSn(), PositionUtils.URL_TAIL));
savePosition(positionEntities, parentPosition, i);
* 报错地址信息
private void savePosition(List<PositionEntity> positionEntities, Position parentPosition, int i){
for (PositionEntity entity : positionEntities) {
Position position = new Position();
position.setSn(entity.getCode());
position.setFullInitials(PinyinUtils.strFirst2Pinyin((parentPosition != null ? parentPosition.getFullName() : "")+entity.getName()));
position.setFullName((parentPosition != null ? parentPosition.getFullName() : "")+entity.getName());
position.setLevel(i + 1);
position.setName(entity.getName());
position.setOrderNumber(0);
position.setPsn(parentPosition != null ? parentPosition.getSn() : 0L);
long count = positionService.countBySn(position.getSn());
if (count == 0) {
positionService.savePosition(position);
PositionService.java:
public interface PositionService {
void savePosition(Position position);
long countBySn(Long sn);
List<Position> findListByLevel(Integer level);
PositionServiceImpl.java:
@Service
public class PositionServiceImpl extends ServiceImpl<PositionMapper, Position> implements PositionService {
@Override
public void savePosition(Position position) {
baseMapper.insert(position);
@Override
public long countBySn(Long sn) {
return baseMapper.selectCount(new QueryWrapper<Position>().lambda().eq(Position::getSn, sn));
@Override
public List<Position> findListByLevel(Integer level) {
return baseMapper.selectList(new QueryWrapper<Position>().lambda().eq(Position::getLevel, level));
PositionMapper.java:
@Repository
public interface PositionMapper extends BaseMapper<Position> {
Position.java:
@Data
@TableName("position")
@EqualsAndHashCode()
public class Position implements Serializable {
@TableId(type = IdType.AUTO)
private Integer id;
private Long sn;
* 上级地址编码
private Long psn;
private String name;
private String shortName;
private Integer level;
private String code;
* 邮政编码
private String zip;
private String spell;
* 拼音首字母
private String spellFirst;
* 地址全名
private String fullName;
* 地址全名拼音首字母
private String fullInitials;
private Integer orderNumber;
PositionMapper.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd">
<mapper namespace="com.wkf.workrecord.dao.PositionMapper">
</mapper>
PositionUtils.java:
public class PositionUtils {
public final static String URL_HEAD = "https://xingzhengquhua.bmcx.com/";
public final static String URL_TAIL = "__xingzhengquhua/";
public static List<PositionEntity> reqPosition(String url) throws InterruptedException {
String htmlStr = HttpUtils.getRequest(url);
//解析字符串为Document对象
Document doc = Jsoup.parse(htmlStr);
//获取body元素,获取class="fc"的table元素
Elements table = doc.body().getElementsByAttributeValue("bgcolor", "#C5D5C5");
//获取tbody元素
Elements children;
children = table.first().children();
//获取tr元素集合
Elements tr = children.get(0).getElementsByTag("tr");
List<PositionEntity> result = new ArrayList<>();
//遍历tr元素,获取td元素,并打印
for (int i = 3; i < tr.size(); i++) {
Element e1 = tr.get(i);
Elements td = e1.getElementsByTag("td");
if (td.size() < 2) {
break;
String name = td.get(0).getElementsByTag("td").first().getElementsByTag("a").text();
String code = td.get(1).getElementsByTag("td").first().getElementsByTag("a").text();
if (CheckUtils.isEmpty(name) || CheckUtils.isEmpty(code)) {
continue;
result.add(new PositionEntity(name, Long.parseLong(code)));
//防止ip被封
Thread.sleep(10000);
return result;
PinyinUtils.java: