Python怎么爬取B站视频弹幕

发布时间：2021-11-23 11:28:11 作者：iii
来源：亿速云阅读：531

# Python怎么爬取B站视频弹幕

## 前言

在B站（哔哩哔哩）观看视频时，弹幕是其最具特色的功能之一。这些实时飘过的评论不仅增加了视频的互动性，也蕴含了大量用户反馈数据。对于数据分析师、内容创作者或爱好者来说，爬取这些弹幕数据可以帮助分析观众情绪、热门话题等。本文将详细介绍如何使用Python爬取B站视频弹幕。

---

## 一、准备工作

### 1.1 理解B站弹幕机制
B站的弹幕数据通常存储在XML或JSON格式的文件中，每个视频都有对应的弹幕文件（`cid`标识）。需要通过视频的`bvid`或`aid`先获取到`cid`，再通过`cid`获取弹幕。

### 1.2 安装必要的Python库
```bash
pip install requests beautifulsoup4 lxml

1.3 目标分析

以B站视频 BV1GJ411x7h7 为例： 1. 获取视频的cid 2. 通过cid请求弹幕接口 3. 解析并存储弹幕数据

二、获取视频CID

2.1 通过API获取CID

B站提供了公开API来获取视频信息，其中包含cid。构造请求URL如下：

import requests

def get_cid(bvid):
    url = f"https://api.bilibili.com/x/player/pagelist?bvid={bvid}&jsonp=jsonp"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        return data['data'][0]['cid']
    else:
        raise Exception("Failed to get CID")

bvid = "BV1GJ411x7h7"
cid = get_cid(bvid)
print(f"CID: {cid}")

2.2 备用方案：从网页源码提取

如果API不可用，可以通过解析视频页面获取：

from bs4 import BeautifulSoup

def get_cid_from_html(bvid):
    url = f"https://www.bilibili.com/video/{bvid}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    script = soup.find("script", text=lambda t: "window.__playinfo__" in str(t))
    # 通过正则提取cid
    import re
    cid = re.search(r'"cid":(\d+)', script.string).group(1)
    return int(cid)

三、获取弹幕数据

3.1 B站弹幕接口

B站的弹幕接口为：

https://api.bilibili.com/x/v1/dm/list.so?oid={cid}

3.2 请求并解析XML

def get_danmaku(cid):
    url = f"https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"
    response = requests.get(url)
    response.encoding = 'utf-8'
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(response.text, 'lxml')
    danmus = soup.find_all('d')
    return [danmu.text for danmu in danmus]

danmaku_list = get_danmaku(cid)
print(f"获取到{len(danmaku_list)}条弹幕")

3.3 弹幕数据解析

每条弹幕XML格式如下：

<d p="时间戳,弹幕类型,字体大小,颜色,发送时间,弹幕池,用户Hash,数据库ID">弹幕内容</d>

可以进一步解析这些属性：

import re

def parse_danmaku(danmu):
    pattern = r'<d p="(.*?)">(.*?)</d>'
    matches = re.findall(pattern, str(danmu))
    result = []
    for m in matches:
        attrs = m[0].split(',')
        item = {
            'time': float(attrs[0]),
            'type': int(attrs[1]),
            'size': int(attrs[2]),
            'color': int(attrs[3]),
            'timestamp': int(attrs[4]),
            'content': m[1]
        }
        result.append(item)
    return result

四、数据存储

4.1 存储为TXT文件

def save_as_txt(danmaku_list, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        for danmu in danmaku_list:
            f.write(danmu + '\n')

save_as_txt(danmaku_list, 'danmaku.txt')

4.2 存储为CSV

import pandas as pd

def save_as_csv(parsed_danmaku, filename):
    df = pd.DataFrame(parsed_danmaku)
    df.to_csv(filename, index=False)

save_as_csv(parse_danmaku(danmaku_list), 'danmaku.csv')

4.3 存储到数据库（MySQL示例）

import pymysql

def save_to_mysql(parsed_danmaku):
    conn = pymysql.connect(host='localhost',
                         user='root',
                         password='password',
                         database='bilibili')
    cursor = conn.cursor()
    sql = """INSERT INTO danmaku 
             (time, type, size, color, timestamp, content) 
             VALUES (%s, %s, %s, %s, %s, %s)"""
    for item in parsed_danmaku:
        cursor.execute(sql, (item['time'], item['type'], 
                           item['size'], item['color'],
                           item['timestamp'], item['content']))
    conn.commit()
    conn.close()

五、高级技巧

5.1 处理分P视频

对于多P视频，需要遍历所有分P的CID：

def get_all_cids(bvid):
    url = f"https://api.bilibili.com/x/player/pagelist?bvid={bvid}"
    response = requests.get(url)
    return [item['cid'] for item in response.json()['data']]

5.2 异步爬取提升效率

使用aiohttp加速请求：

import aiohttp
import asyncio

async def async_get_danmaku(cid):
    async with aiohttp.ClientSession() as session:
        url = f"https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"
        async with session.get(url) as response:
            text = await response.text()
            soup = BeautifulSoup(text, 'lxml')
            return [d.text for d in soup.find_all('d')]

5.3 绕过反爬机制

添加Headers模拟浏览器
使用代理IP
设置请求间隔

headers = {
    'User-Agent': 'Mozilla/5.0',
    'Referer': 'https://www.bilibili.com/'
}
proxies = {'http': 'http://127.0.0.1:1080'}

六、数据分析示例

6.1 弹幕词云生成

from wordcloud import WordCloud
import jieba

text = ' '.join(danmaku_list)
wordlist = ' '.join(jieba.cut(text))
wc = WordCloud(font_path='simhei.ttf').generate(wordlist)
wc.to_file('danmaku_cloud.png')

6.2 弹幕时间分布分析

import matplotlib.pyplot as plt

times = [d['time'] for d in parsed_danmaku]
plt.hist(times, bins=50)
plt.xlabel('Video Time (s)')
plt.ylabel('Danmaku Count')
plt.show()

七、注意事项

遵守Robots协议：B站robots.txt对部分路径有限制
控制请求频率：避免高频请求导致IP被封
数据用途：仅限个人学习使用，不得用于商业用途
版权声明：弹幕数据属于用户生成内容，需注意隐私问题

结语

本文详细介绍了从B站视频爬取弹幕的完整流程，包括： - 获取视频CID的两种方法 - 请求和解析弹幕XML数据 - 多种存储方式 - 高级技巧和数据分析示例

通过Python爬取B站弹幕不仅可以帮助我们理解B站的API结构，也为后续的数据分析工作奠定了基础。希望这篇教程对你有所帮助！

注意：本文所有代码示例仅供学习参考，实际使用时请遵守B站的相关规定。 “`

（全文约3100字）