name: erp-curl-workflow description: "cew — ERP 欧普V8 通用 curl 数据抓取框架：登录→查询→退出 + himalaya 邮件收附件 + 多维度 mode 对照。简称 cew。" version: 1.0.0

cew — ERP curl 数据工作流

欧普V8 移动报表（http://182.61.44.242:16888）的通用 curl 数据抓取框架。涵盖登录/查询/退出三件套、HTML 解析、多维度 mode 对照、himalaya 邮件附件工作流。

这是基础框架 — 新增任何 ERP 数据维度的第一步都是读本文。

系统信息

项目	值
URL	http://182.61.44.242:16888
企业编号	st
账号	888
密码	123456
系统	欧普V8移动报表 V3.1

0. 完整生命周期：登录 → 查询 → 退出

这是硬性规则。 每次 curl 访问 ERP 必须走完整三步，和浏览器操作习惯一致。不退出会导致 session 残留，给服务器造成不必要的压力。

# Step 1: 登录
curl -s -c /tmp/erp_cookies.txt \
  -X POST http://182.61.44.242:16888/login \
  -d "tenant_code=st&username=888&password=123456" \
  -o /dev/null -w "%{http_code}"
# 期望: 303

# Step 2: 查询（可多次，同一个 session）
curl -s -b /tmp/erp_cookies.txt \
  "http://182.61.44.242:16888/retail/summary?...参数..."

# Step 3: 退出（释放 session）
curl -s -b /tmp/erp_cookies.txt \
  "http://182.61.44.242:16888/logout" \
  -o /dev/null -w "%{http_code}"
# 期望: 200

⚠️ 为什么必须退出

	浏览器	curl
用户操作	查完数据点「退出」	⚠️ 容易忘记退出
session 管理	退出主动销毁	不退出=等超时（~30分钟）
服务器压力	单次	累积的僵尸 session

验证数据：退出后 session 立即失效（查询返回 307 重定向到登录页，数据量 0）。退出是真实服务器端的 session 销毁，不是简单的客户端 cookie 清理。

1. Curl 登录与会话管理

Python 三件套封装

import subprocess

ERP_URL = "http://182.61.44.242:16888"
COOKIE_FILE = "/tmp/erp_cookies.txt"

def erp_login():
    """登录 → 返回 True/False"""
    out, _, _ = run_cmd(
        f'curl -s -c {COOKIE_FILE} -X POST {ERP_URL}/login '
        f'-d "tenant_code=st&username=888&password=123456" '
        f'-o /dev/null -w "%{{http_code}}"'
    )
    return out.strip() in ('200', '303')

def erp_logout():
    """退出 → 释放 session"""
    code, _, _ = run_cmd(
        f'curl -s -b {COOKIE_FILE} "{ERP_URL}/logout" '
        f'-o /dev/null -w "%{{http_code}}"'
    )
    return code.strip() == '200'

def erp_query(url_suffix, timeout=15):
    """查询 → 返回 HTML 字符串"""
    html, _, _ = run_cmd(
        f'curl -s -b {COOKIE_FILE} "{ERP_URL}{url_suffix}"',
        timeout=timeout
    )
    return html

def run_cmd(cmd, timeout=15):
    r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
    return r.stdout, r.stderr, r.returncode

⚠️ 不要用 requests 库

用 subprocess + curl 而非 Python requests： - 零依赖（curl 已在系统中） - cookie 引擎 (-c/-b) 简单可靠 - 保持与现有 erp_fetch.py 一致

⚠️ Cookie 文件冲突

如果多个脚本同时跑（例如 cron + 手动），共用 /tmp/erp_cookies.txt 会互相覆盖。 → 长时间/并行任务用独立 cookie 文件：/tmp/erp_cookies_${TIMESTAMP}.txt

2. ERP 可用模块

路径	用途	当前状态
`/retail/summary`	零售统计（销售数据）	✅ 已使用
`/retail/...`	零售其他子页面（待探索）	⚠️ 待发现
`/inventory`	库存统计	⚠️ 待探索
`/subscription`	订阅管理	⚠️ 待探索
`/logout`	退出登录	✅ 已验证

如何发现新模块

# 登录后浏览首页看菜单链接
curl -s -b /tmp/erp_cookies.txt http://182.61.44.242:16888/retail | \
  grep -oP 'href="[^"]*"' | sort -u

3. mode 参数完整对照表

/retail/summary 用 mode 参数控制数据聚合维度。已知 mode：

mode	维度	说明
0	商品汇总	按商品编码聚合
1	商品+颜色汇总	商品+颜色编码维度
2	⚠️ 待发现	—
3	营业员汇总	按营业员聚合
4	⚠️ 待发现	—
5	⚠️ 待发现	—
6	⚠️ 待发现	—
7	店铺汇总	按店铺聚合
8	店铺+商品汇总	店铺+商品维度
9	⚠️ 待发现	—
10	店铺+商品+颜色汇总	✅ 即时销售表
11	店铺+商品+颜色(显示尺码)	比 mode=10 多尺码列
12	日期汇总	按日期聚合
13	⚠️ 待发现	—
14	月份汇总	按月聚合
15	⚠️ 待发现	—
16	折扣汇总	按折扣聚合
17-23	⚠️ 待发现	—

探索未知 mode

for mode in 2 4 5 6 9 13 15 17 18 19 20 21 22 23; do
  html=$(curl -s -b /tmp/erp_cookies.txt \
    "http://182.61.44.242:16888/retail/summary?start_date=2026-05-27&end_date=2026-05-27&mode=$mode&is_pos=1")
  cols=$(echo "$html" | grep -oP '(?<=<th>)[^<]+' | head -10 | tr '\n' '|')
  echo "mode=$mode: $cols"
done

查询参数说明

参数	示例	说明
`start_date`	2026-05-27	开始日期
`end_date`	2026-05-27	结束日期
`mode`	10	聚合维度
`is_pos`	1	含实时 POS 数据
`price_type`	CKJJ	参考进价
`spid`	（空）	商品筛选
`cdid`	（空）	颜色筛选
`ywyid`	（空）	营业员筛选

4. HTML 解析模式

ERP 数据页面是标准的 <table> + <tr>/<td> 结构：

import re

def parse_erp_table(html):
    """通用 HTML table 解析器"""
    rows = re.findall(r'<tr[^>]*>(.*?)</tr>', html, re.DOTALL)
    header = None
    data = []

    for row_html in rows:
        cells = re.findall(r'<t[dh][^>]*>(.*?)</t[dh]>', row_html, re.DOTALL)
        if not cells:
            continue
        clean = [re.sub(r'<[^>]+>', '', c).strip() for c in cells]
        if 'item.id' in row_html:
            continue
        if header is None and any(kw in ' '.join(clean) for kw in ['编码', '名称', '数量', '金额']):
            header = [h.replace('↑↓', '').strip() for h in clean]
            continue
        if header:
            d = {header[i]: clean[i] if i < len(clean) else '' for i in range(len(header))}
            data.append(d)

    return header, data

不同 mode 的表头差异

mode=10（店铺+商品+颜色）：店铺编码, 店铺名称, 商品编码, 商品名称, 颜色编码, 颜色名称, 数量, 金额, 选择价格金额, 折扣, 均价...
mode=7（店铺汇总）：店铺编码, 店铺名称, 数量, 金额, 选择价格金额, 折扣, 均价
mode=3（营业员汇总）：营业员编码, 营业员名称, 数量, 金额, 选择价格金额, 折扣, 均价

⚠️ 解析坑

↑↓ 排序标记 — 表头用 .replace('↑↓', '') 清理
空行 — <tr> 中空 <td> 行需跳过
JS 模板行 — 含 item.id 的是模板，需跳过
HTML 实体 —   替换为空字符串

5. Python 完整抓取模板

每次运行必须包含登录→查询→退出三件套：

#!/usr/bin/env python3
"""ERP 通用数据抓取模板 — 包含完整生命周期"""
import subprocess, json, re, sys
from datetime import datetime

ERP_URL = "http://182.61.44.242:16888"
COOKIE_FILE = "/tmp/erp_cookies.txt"

def run_cmd(cmd, timeout=15):
    r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
    return r.stdout, r.stderr, r.returncode

def fetch(mode, date_str=None, extra_params=""):
    """登录→查询→退出 完整流程"""
    if date_str is None:
        date_str = datetime.now().strftime("%Y-%m-%d")

    # Step 1: 登录
    code = run_cmd(
        f'curl -s -c {COOKIE_FILE} -X POST {ERP_URL}/login '
        f'-d "tenant_code=st&username=888&password=123456" '
        f'-o /dev/null -w "%{{http_code}}"'
    )[0].strip()
    if code not in ('200', '303'):
        raise RuntimeError(f"Login failed: HTTP {code}")

    try:
        # Step 2: 查询
        url = (f"/retail/summary?"
               f"start_date={date_str}&end_date={date_str}"
               f"&spid=&cdid=&ztstate=&ywyid="
               f"&mode={mode}&is_pos=1{extra_params}")

        html = run_cmd(f'curl -s -b {COOKIE_FILE} "{ERP_URL}{url}"')[0]

        if not html or len(html) < 500:
            raise ValueError(f"Empty/short response ({len(html)} bytes)")

        # 解析（按需自定义这里）
        rows = re.findall(r'<tr[^>]*>(.*?)</tr>', html, re.DOTALL)
        header = None
        data = []
        for row_html in rows:
            cells = re.findall(r'<t[dh][^>]*>(.*?)</t[dh]>', row_html, re.DOTALL)
            if not cells:
                continue
            clean = [re.sub(r'<[^>]+>', '', c).strip() for c in cells]
            if 'item.id' in row_html:
                continue
            if header is None and any(kw in ' '.join(clean) for kw in ['编码', '数量', '金额']):
                header = [h.replace('↑↓', '').strip() for h in clean]
                continue
            if header and clean[0]:
                d = {header[i]: clean[i] if i < len(clean) else '' for i in range(len(header))}
                data.append(d)

        return {'header': header, 'rows': data, 'date': date_str}

    finally:
        # Step 3: 退出（无论如何都执行）
        run_cmd(f'curl -s -b {COOKIE_FILE} "{ERP_URL}/logout" -o /dev/null')

if __name__ == '__main__':
    mode = int(sys.argv[1]) if len(sys.argv) > 1 else 10
    result = fetch(mode)
    print(f"mode={mode}: {len(result['rows'])} rows, header={result['header']}")

关键： try...finally 确保即使查询出错也执行退出，不留僵尸 session。

6. Himalaya 邮件工作流

配置验证

himalaya envelope list -a foxmail -s 5  # 确认连接正常

ERP 相关邮件类型

邮件主题	来源	附件内容	触发
`【Mobile BI】自动订阅报表 (1份)`	欧普自动推送	4个 xlsx	cron 22:30
`【数据中台】零售汇总统计报表`	数据中台	2个 xlsx	手动

通用邮件处理

# 1. 查邮件
himalaya envelope list -a foxmail -f INBOX -s 20

# 2. 筛选
himalaya envelope list -a foxmail -f INBOX -s 50 2>/dev/null | \
  grep -E "Mobile BI|数据中台"

# 3. 读正文（确认附件）
himalaya message read <id>

# 4. 下载附件（⚠️ 到 /tmp/）
cd /tmp && himalaya attachment download <id>

⚠️ 关键坑

附件路径 — 永远是 /tmp/，无 --dir
重复文件 — 同名加 _1 后缀
邮件 ID — -s N 是输出行数限制，ID 可能有空洞
必选 -a foxmail — 否则用默认账户

7. 通用 Workflow 模板

模板 A：纯 curl（实时数据）

erp_login() → erp_query(mode=N) → parse → process → erp_logout()

模板 B：纯邮件（批量/历史数据）

himalaya list → message read → attachment download → pandas read_excel → import DB

模板 C：混合

curl 抓实时数据 + himalaya 收历史数据 → 合并/交叉验证

8. 新增维度的标准流程

定 mode — 参考对照表，未知的 curl 探索
写脚本 — 复制第 5 节模板改 mode
看表头 — 跑一次确认 <th> 列名
写逻辑 — 针对该 mode 做字段映射
选输出 — JSON / DB / HTML / 飞书
加邮件 — 如需邮件触发，接 himalaya
设定时 — 需要的话加 cron

9. 相关 Skill

Skill	关系
`erp-retail-report`	mode=10 即时销售表实现
`sietadata-pipeline`	himalaya 每日进销存流水线
`himalaya`	邮件 CLI 文档

10. 待探索

[ ] mode 2,4,5,6,9,13,15,17-23 的具体维度
[ ] /inventory 模块（库存 curl 直查）
[ ] /subscription 模块
[ ] 未知的隐藏页面/端点
[ ] 库存能否用 curl 替代邮件 xlsx