Python中如何使用Github用户数据爬虫

发布时间：2021-10-09 16:19:34 作者：柒染
来源：亿速云阅读：170

# Python中如何使用Github用户数据爬虫

## 引言

在当今数据驱动的时代，获取和分析开源平台数据对开发者、研究人员和企业都具有重要价值。作为全球最大的代码托管平台，Github拥有超过1亿开发者和3.2亿个仓库，这些数据蕴含着丰富的技术趋势、开发者行为模式和项目生态信息。本文将详细介绍如何使用Python构建Github用户数据爬虫，从基础API使用到高级数据采集技术，帮助读者合法合规地获取所需数据。

## 一、前期准备

### 1.1 Github API概述

Github提供完善的REST API和GraphQL API两种接口：
- REST API：传统请求响应模式，包含数十个端点
- GraphQL API：灵活查询，可精确获取所需字段

API速率限制：
- 未认证用户：60次/小时
- 基础认证：5,000次/小时
- 最佳实践：使用多个令牌轮询

### 1.2 开发环境配置

推荐工具栈：
```python
# 核心依赖库
import requests  # API请求
import pandas as pd  # 数据处理
from datetime import datetime  # 时间处理
import time  # 速率控制
import json  # 数据解析

认证配置示例：

# 在config.py中存储令牌
GITHUB_TOKENS = ["ghp_abc123", "ghp_def456"] 
current_token_idx = 0

def get_token():
    global current_token_idx
    token = GITHUB_TOKENS[current_token_idx]
    current_token_idx = (current_token_idx + 1) % len(GITHUB_TOKENS)
    return token

二、基础数据采集

2.1 用户基础信息获取

构建基础请求函数：

def get_user_info(username):
    url = f"https://api.github.com/users/{username}"
    headers = {"Authorization": f"token {get_token()}"}
    
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.HTTPError as err:
        print(f"Error fetching {username}: {err}")
        return None

关键字段解析：

def parse_basic_info(raw_data):
    return {
        "login": raw_data.get("login"),
        "id": raw_data.get("id"),
        "name": raw_data.get("name"),
        "company": raw_data.get("company"),
        "location": raw_data.get("location"),
        "public_repos": raw_data.get("public_repos"),
        "followers": raw_data.get("followers"),
        "created_at": datetime.strptime(
            raw_data.get("created_at"), 
            "%Y-%m-%dT%H:%M:%SZ"
        ),
        "updated_at": datetime.strptime(
            raw_data.get("updated_at"), 
            "%Y-%m-%dT%H:%M:%SZ"
        )
    }

2.2 分页数据处理

Github API采用分页返回结果，标准处理方式：

def get_all_repos(username, per_page=100):
    url = f"https://api.github.com/users/{username}/repos"
    headers = {"Authorization": f"token {get_token()}"}
    params = {"per_page": per_page, "page": 1}
    
    all_repos = []
    while True:
        response = requests.get(url, headers=headers, params=params)
        repos = response.json()
        if not repos:
            break
        all_repos.extend(repos)
        
        # 检查下一页
        if "next" in response.links:
            params["page"] += 1
        else:
            break
            
        time.sleep(1)  # 遵守速率限制
        
    return all_repos

三、高级数据采集技术

3.1 GraphQL查询

复杂数据查询示例：

query = """
query ($login: String!) {
  user(login: $login) {
    name
    contributionsCollection {
      totalCommitContributions
      totalPullRequestContributions
      totalRepositoriesWithContributedCommits
    }
    repositories(first: 100, orderBy: {field: STARGAZERS, direction: DESC}) {
      nodes {
        name
        stargazers {
          totalCount
        }
      }
    }
  }
}
"""

def graphql_query(username):
    headers = {
        "Authorization": f"Bearer {get_token()}",
        "Content-Type": "application/json"
    }
    variables = {"login": username}
    payload = {"query": query, "variables": variables}
    
    response = requests.post(
        "https://api.github.com/graphql",
        headers=headers,
        json=payload
    )
    return response.json()

3.2 增量数据采集

使用since参数获取更新数据：

def get_updated_users(since_user_id):
    url = "https://api.github.com/users"
    params = {"since": since_user_id, "per_page": 100}
    headers = {"Authorization": f"token {get_token()}"}
    
    response = requests.get(url, headers=headers, params=params)
    return response.json()

四、数据存储与分析

4.1 数据存储方案

多格式存储示例：

# JSON存储
def save_to_json(data, filename):
    with open(f"data/{filename}.json", "w") as f:
        json.dump(data, f, indent=2, default=str)

# CSV存储
def save_to_csv(dataframe, filename):
    dataframe.to_csv(f"data/{filename}.csv", index=False)

# 数据库存储 (SQLite示例)
import sqlite3
def save_to_db(data, table_name):
    conn = sqlite3.connect("github_data.db")
    df = pd.DataFrame(data)
    df.to_sql(table_name, conn, if_exists="append", index=False)
    conn.close()

4.2 基础数据分析

使用pandas进行数据分析：

def analyze_user_data():
    df = pd.read_json("data/users.json")
    
    # 基础统计
    print(f"总用户数: {len(df)}")
    print(f"平均关注者数: {df['followers'].mean():.1f}")
    
    # 时间序列分析
    df['created_year'] = pd.to_datetime(df['created_at']).dt.year
    yearly_users = df.groupby('created_year').size()
    yearly_users.plot(kind='bar', title='Yearly User Growth')
    
    # 公司分布
    company_dist = df['company'].value_counts().head(10)
    print("\nTop 10 Companies:")
    print(company_dist)

五、合规与优化

5.1 遵守Github政策

关键合规要求： - 认证请求必须包含有效的User-Agent头 - 严格遵守速率限制（5000次/小时/令牌） - 禁止自动爬取未经许可的数据 - 缓存数据至少1小时不变

合规请求头示例：

headers = {
    "Authorization": f"token {get_token()}",
    "User-Agent": "ResearchBot/1.0 (https://yourdomain.com)",
    "Accept": "application/vnd.github.v3+json"
}

5.2 性能优化技巧

并行处理（遵守速率限制）：

from concurrent.futures import ThreadPoolExecutor

def batch_get_users(usernames, max_workers=3):
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(get_user_info, usernames))
    return [r for r in results if r is not None]

智能缓存机制：

from diskcache import Cache

cache = Cache("api_cache")

@cache.memoize(expire=3600)  # 缓存1小时
def cached_api_call(url):
    response = requests.get(url)
    return response.json()

优雅降级策略：

def resilient_request(url, retries=3):
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=10)
            return response.json()
        except Exception as e:
            print(f"Attempt {attempt+1} failed: {str(e)}")
            time.sleep(2 ** attempt)  # 指数退避
    return None

六、实战案例

6.1 开发者影响力分析

构建开发者评分模型：

def calculate_influence_score(user):
    score = 0
    score += user['followers'] * 2
    score += user['public_repos'] * 1.5
    score += min(user['followers'] / 1000, 10)  # 对数缩放
    
    # 加入贡献度因素
    if 'contributions' in user:
        score += user['contributions']['commit'] * 0.1
        score += user['contributions']['pr'] * 0.5
        
    return round(score, 2)

6.2 技术趋势分析

分析语言流行度：

def analyze_language_trends():
    # 获取Top1000活跃用户
    users = get_top_active_users(limit=1000)
    
    language_counter = Counter()
    for user in users:
        repos = get_user_repos(user['login'])
        for repo in repos:
            if repo['language']:
                language_counter[repo['language']] += 1
                
    return language_counter.most_common(10)

结语

本文详细介绍了从基础到高级的Github数据采集技术。值得注意的几点： 1. 始终遵守Github的服务条款 2. 设计爬虫时考虑数据最小化原则 3. 复杂分析建议使用官方数据集（如GH Archive） 4. 考虑使用现成的开源库（如PyGithub）

通过合规的数据采集和分析，我们可以获得有价值的洞察，同时维护良好的开发者生态。随着Github API的不断演进，建议定期查阅官方文档更新采集策略。

附录：实用资源 1. Github官方API文档 2. PyGithub库文档 3. Github数据科学工具集 4. GraphQL学习资源 “`

这篇文章包含了约3650字，采用Markdown格式编写，覆盖了从基础到高级的Github数据采集技术，包含代码示例、注意事项和实用建议。文章结构清晰，适合不同技术水平的读者参考使用。