extractor

feishu.attachments.extractor ¶

附件内容提取：把图片 / PDF / Office / 文本附件的字节安全地抽取为「可交给多模态模型」的中性内容。

feishu.attachments.extractor.AttachmentExtractor 抽象「字节 -> feishu.attachments.extractor.ExtractedContent」，默认实现 feishu.attachments.extractor.SandboxedAttachmentExtractor 在一个可被 SIGKILL 终止的子进程中运行提取，并施加体积、像素、ZIP（防 zip bomb）、页数、字符等上限，从而即便面对恶意构造的文件也不会拖垮或撑爆进程。抽取结果是中性的（文本 + 图片 + 元信息，措辞为英文），不含任何产品提示词；产品侧再用 feishu.attachments.extractor.to_openai_content_parts 配上自己的分析提示词构造模型输入。

PDF / Office 解析依赖可选库（PyMuPDF、python-docx、openpyxl、python-pptx、Pillow），按需惰性导入；缺失时对应格式优雅降级为「无法提取」而非报错。

ExtractLimits `dataclass` ¶

附件提取的各项上限；传给 feishu.attachments.extractor.SandboxedAttachmentExtractor 以覆盖默认值。

源代码位于： feishu/attachments/extractor.py

Python
@dataclass(frozen=True)
class ExtractLimits:
    r"""附件提取的各项上限；传给 [feishu.attachments.extractor.SandboxedAttachmentExtractor][] 以覆盖默认值。"""

    image_max_pixels: int = IMAGE_MAX_PIXELS
    image_max_bytes: int = 5 * 1024 * 1024
    inline_text_max_chars: int = INLINE_TEXT_MAX_CHARS
    extract_text_max_chars: int = EXTRACT_TEXT_MAX_CHARS
    pdf_extract_max_pages: int = PDF_EXTRACT_MAX_PAGES
    pdf_render_max_pages: int = PDF_RENDER_MAX_PAGES
    pdf_render_zoom: float = PDF_RENDER_ZOOM
    pdf_render_max_pixels: int = PDF_RENDER_MAX_PIXELS
    pdf_render_max_image_bytes: int = PDF_RENDER_MAX_IMAGE_BYTES
    office_zip_max_entries: int = OFFICE_ZIP_MAX_ENTRIES
    office_zip_max_uncompressed_bytes: int = OFFICE_ZIP_MAX_UNCOMPRESSED_BYTES
    docx_max_tables: int = DOCX_MAX_TABLES
    docx_max_table_rows: int = DOCX_MAX_TABLE_ROWS
    xlsx_max_sheets: int = XLSX_MAX_SHEETS
    xlsx_max_rows_per_sheet: int = XLSX_MAX_ROWS_PER_SHEET
    pptx_max_slides: int = PPTX_MAX_SLIDES

ExtractedImage `dataclass` ¶

一张可交给多模态模型的图片：原始字节、媒体类型与（若可得）像素尺寸。

源代码位于： feishu/attachments/extractor.py

Python
@dataclass
class ExtractedImage:
    r"""一张可交给多模态模型的图片：原始字节、媒体类型与（若可得）像素尺寸。"""

    data: bytes
    media_type: str
    width: int | None = None
    height: int | None = None

ExtractedContent `dataclass` ¶

附件提取结果：分类、媒体类型、抽取文本、图片与元信息。

note 为中性英文说明（如体积/像素超限、提取超时），不含任何产品提示词。

示例：

Python Console Session
>>> ExtractedContent(kind="text", media_type="text/plain", text="hi").kind
'text'

源代码位于： feishu/attachments/extractor.py

Python
@dataclass
class ExtractedContent:
    r"""
    附件提取结果：分类、媒体类型、抽取文本、图片与元信息。

    `note` 为中性英文说明（如体积/像素超限、提取超时），不含任何产品提示词。

    Examples:
        >>> ExtractedContent(kind="text", media_type="text/plain", text="hi").kind
        'text'
    """

    kind: str
    media_type: str
    text: str | None = None
    images: list[ExtractedImage] = field(default_factory=list)
    total_pages: int | None = None
    truncated: bool = False
    note: str | None = None
    size_bytes: int = 0

AttachmentExtractor ¶

Bases: Protocol

附件提取协议：把附件字节抽取为 feishu.attachments.extractor.ExtractedContent。

内置实现为 feishu.attachments.extractor.SandboxedAttachmentExtractor。该协议标注了 runtime_checkable。

源代码位于： feishu/attachments/extractor.py

Python
@runtime_checkable
class AttachmentExtractor(Protocol):
    r"""
    附件提取协议：把附件字节抽取为 [feishu.attachments.extractor.ExtractedContent][]。

    内置实现为 [feishu.attachments.extractor.SandboxedAttachmentExtractor][]。该协议标注了 `runtime_checkable`。
    """

    async def extract(
        self, data: bytes, *, file_metadata: Mapping[str, Any], media_type: str | None = None
    ) -> ExtractedContent:
        r"""抽取附件内容；`media_type` 省略时按魔数 / 文件名推断。"""
        ...

extract `async` ¶

Python

extract(data: bytes, *, file_metadata: Mapping[str, Any], media_type: str | None = None) -> ExtractedContent

抽取附件内容；media_type 省略时按魔数 / 文件名推断。

源代码位于： feishu/attachments/extractor.py

Python
async def extract(
    self, data: bytes, *, file_metadata: Mapping[str, Any], media_type: str | None = None
) -> ExtractedContent:
    r"""抽取附件内容；`media_type` 省略时按魔数 / 文件名推断。"""
    ...

SandboxedAttachmentExtractor ¶

在可被 SIGKILL 终止的子进程中运行附件提取的默认 feishu.attachments.extractor.AttachmentExtractor。

超过 max_bytes 直接拒绝；其余在子进程中按 feishu.attachments.extractor.ExtractLimits 提取，并以 timeout_seconds 硬超时（子进程会被 terminate/kill），以信号量限制并发，从而抵御 zip bomb / 像素炸弹 / 解析挂死等拒绝服务向量。任何异常都降级为带 note 的 unknown 结果，绝不抛出。

参数：

名称	类型	描述	默认
`timeout_seconds` ¶	`float`	单个附件的提取硬超时秒数。默认为 `20`。	`20.0`
`max_bytes` ¶	`int`	允许提取的最大字节数。默认为 `16 MiB`。	`16 * 1024 * 1024`
`max_concurrency` ¶	`int`	同时进行的提取数上限。默认为 `2`。	`2`
`limits` ¶	`ExtractLimits \| None`	各项提取上限 feishu.attachments.extractor.ExtractLimits。	`None`

示例：

Python Console Session
>>> isinstance(SandboxedAttachmentExtractor(), AttachmentExtractor)
True

源代码位于： feishu/attachments/extractor.py

Python
class SandboxedAttachmentExtractor:
    r"""
    在可被 SIGKILL 终止的子进程中运行附件提取的默认 [feishu.attachments.extractor.AttachmentExtractor][]。

    超过 `max_bytes` 直接拒绝；其余在子进程中按 [feishu.attachments.extractor.ExtractLimits][] 提取，并以
    `timeout_seconds` 硬超时（子进程会被 terminate/kill），以信号量限制并发，从而抵御 zip bomb / 像素炸弹 /
    解析挂死等拒绝服务向量。任何异常都降级为带 `note` 的 `unknown` 结果，绝不抛出。

    Args:
        timeout_seconds: 单个附件的提取硬超时秒数。默认为 `20`。
        max_bytes: 允许提取的最大字节数。默认为 `16 MiB`。
        max_concurrency: 同时进行的提取数上限。默认为 `2`。
        limits: 各项提取上限 [feishu.attachments.extractor.ExtractLimits][]。

    Examples:
        >>> isinstance(SandboxedAttachmentExtractor(), AttachmentExtractor)
        True
    """

    def __init__(
        self,
        *,
        timeout_seconds: float = 20.0,
        max_bytes: int = 16 * 1024 * 1024,
        max_concurrency: int = 2,
        limits: ExtractLimits | None = None,
    ) -> None:
        self.timeout_seconds = timeout_seconds
        self.max_bytes = max_bytes
        self.limits = limits or ExtractLimits()
        self._semaphore = asyncio.Semaphore(max_concurrency)

    async def extract(
        self, data: bytes, *, file_metadata: Mapping[str, Any], media_type: str | None = None
    ) -> ExtractedContent:
        resolved = media_type or detect_media_type(data) or media_type_from_metadata(dict(file_metadata))
        if len(data) > self.max_bytes:
            return ExtractedContent(
                kind="unknown",
                media_type=resolved,
                size_bytes=len(data),
                note=f"attachment too large ({len(data)} bytes > {self.max_bytes} limit)",
            )
        name = _attachment_name(dict(file_metadata))
        async with self._semaphore:
            try:
                return await asyncio.wait_for(
                    asyncio.to_thread(_run_killable_extract, data, resolved, name, self.limits, self.timeout_seconds),
                    timeout=self.timeout_seconds + 10,  # backstop strictly exceeds the inner kill-and-cleanup budget
                )
            except asyncio.TimeoutError:
                return ExtractedContent(
                    kind="unknown", media_type=resolved, size_bytes=len(data), note="extraction timed out"
                )
            except Exception as exc:  # noqa: BLE001 — extraction must never raise into the caller
                return ExtractedContent(
                    kind="unknown",
                    media_type=resolved,
                    size_bytes=len(data),
                    note=f"extraction failed: {type(exc).__name__}",
                )

extract_content ¶

Python

extract_content(data: bytes, *, media_type: str, name: str | None, limits: ExtractLimits) -> ExtractedContent

同步抽取附件内容（在沙箱子进程内运行）；按媒体类型分派到图片 / PDF / 文本 / Office 提取。

源代码位于： feishu/attachments/extractor.py

Python
def extract_content(data: bytes, *, media_type: str, name: str | None, limits: ExtractLimits) -> ExtractedContent:
    r"""同步抽取附件内容（在沙箱子进程内运行）；按媒体类型分派到图片 / PDF / 文本 / Office 提取。"""
    size = len(data)
    if media_type in _IMAGE_MIME_TYPES:
        if len(data) > limits.image_max_bytes:
            return ExtractedContent(
                kind="image", media_type=media_type, size_bytes=size, note="image exceeds the byte limit; omitted"
            )
        dims = image_dimensions(data, media_type) or _pil_dimensions(data, limits)
        if dims is None:  # fail closed: never forward an image whose dimensions we could not verify
            return ExtractedContent(
                kind="image",
                media_type=media_type,
                size_bytes=size,
                note="image dimensions could not be verified; omitted",
            )
        width, height = dims
        if width <= 0 or height <= 0 or width * height > limits.image_max_pixels:
            return ExtractedContent(
                kind="image", media_type=media_type, size_bytes=size, note="image exceeds the pixel limit; omitted"
            )
        return ExtractedContent(
            kind="image",
            media_type=media_type,
            size_bytes=size,
            images=[ExtractedImage(data, media_type, width, height)],
        )

    if _is_pdf(media_type, name):
        text, images, total_pages, truncated = extract_pdf_content(data, limits)
        return ExtractedContent(
            kind="pdf",
            media_type=media_type,
            size_bytes=size,
            text=text,
            total_pages=total_pages,
            truncated=truncated,
            images=[ExtractedImage(image, "image/png") for image in images],
            note=None if (text or images) else "no text or page images could be extracted",
        )

    text, kind, truncated = extract_text_like(data, media_type=media_type, name=name, limits=limits)
    if text is not None:
        return ExtractedContent(kind=kind, media_type=media_type, size_bytes=size, text=text, truncated=truncated)
    return ExtractedContent(
        kind="unknown",
        media_type=media_type,
        size_bytes=size,
        note="unsupported binary format with no extractable text",
    )

detect_media_type ¶

Python

detect_media_type(data: bytes) -> str | None

按魔数（magic bytes）识别常见图片 / PDF 类型；无法识别返回 None。

源代码位于： feishu/attachments/extractor.py

Python
def detect_media_type(data: bytes) -> str | None:
    r"""按魔数（magic bytes）识别常见图片 / PDF 类型；无法识别返回 `None`。"""
    if data.startswith(b"\xff\xd8\xff"):
        return "image/jpeg"
    if data.startswith(b"\x89PNG\r\n\x1a\n"):
        return "image/png"
    if data.startswith(b"GIF87a") or data.startswith(b"GIF89a"):
        return "image/gif"
    if data.startswith(b"%PDF"):
        return "application/pdf"
    if len(data) >= 12 and data[0:4] == b"RIFF" and data[8:12] == b"WEBP":
        return "image/webp"
    return None

media_type_from_metadata ¶

Python

media_type_from_metadata(file_metadata: Mapping[str, Any]) -> str

从飞书附件元信息（mime_type / 文件名）推断媒体类型；未知时返回 application/octet-stream。

源代码位于： feishu/attachments/extractor.py

Python
def media_type_from_metadata(file_metadata: Mapping[str, Any]) -> str:
    r"""从飞书附件元信息（`mime_type` / 文件名）推断媒体类型；未知时返回 `application/octet-stream`。"""
    mime_type = file_metadata.get("mime_type")
    if isinstance(mime_type, str) and "/" in mime_type:
        return mime_type
    name = _attachment_name(file_metadata)
    if name:
        guessed, _ = mimetypes.guess_type(name)
        if guessed:
            return guessed
    return "application/octet-stream"

image_dimensions ¶

Python

image_dimensions(data: bytes, media_type: str) -> tuple[int, int] | None

仅从文件头解析图片像素尺寸（不整图解码），用于像素炸弹防护。

源代码位于： feishu/attachments/extractor.py

Python
def image_dimensions(data: bytes, media_type: str) -> tuple[int, int] | None:
    r"""仅从文件头解析图片像素尺寸（不整图解码），用于像素炸弹防护。"""
    if media_type == "image/png" and len(data) >= 24 and data.startswith(b"\x89PNG\r\n\x1a\n"):
        return int.from_bytes(data[16:20], "big"), int.from_bytes(data[20:24], "big")
    if media_type == "image/gif" and len(data) >= 10 and (data.startswith(b"GIF87a") or data.startswith(b"GIF89a")):
        return int.from_bytes(data[6:8], "little"), int.from_bytes(data[8:10], "little")
    if media_type == "image/jpeg" and data.startswith(b"\xff\xd8"):
        return _jpeg_dimensions(data)
    if media_type == "image/webp" and len(data) >= 30 and data[0:4] == b"RIFF" and data[8:12] == b"WEBP":
        return _webp_dimensions(data)
    return None

to_openai_content_parts ¶

Python

to_openai_content_parts(content: ExtractedContent, *, prompt: str | None = None, text_label: str = 'Attachment text') -> list[dict[str, Any]]

把 feishu.attachments.extractor.ExtractedContent 转为 OpenAI Chat Completions 的多模态 content 列表。

prompt（产品提供的分析提示词）作为首个文本块；随后是中性的元信息、抽取文本与图片（data URL）。SDK 不内置任何分析提示词，措辞由调用方决定。

参数：

名称	类型	描述	默认
`content` ¶	`ExtractedContent`	提取结果。	必需
`prompt` ¶	`str \| None`	置于最前的分析提示词（产品文案）。	`None`
`text_label` ¶	`str`	抽取文本块的标签。默认为 `"Attachment text"`。	`'Attachment text'`

返回：

类型	描述
`list[dict[str, Any]]`	可作为 `messages[].content` 传给 OpenAI 兼容多模态接口的部件列表。

源代码位于： feishu/attachments/extractor.py

Python
def to_openai_content_parts(
    content: ExtractedContent, *, prompt: str | None = None, text_label: str = "Attachment text"
) -> list[dict[str, Any]]:
    r"""
    把 [feishu.attachments.extractor.ExtractedContent][] 转为 OpenAI Chat Completions 的多模态 `content` 列表。

    `prompt`（产品提供的分析提示词）作为首个文本块；随后是中性的元信息、抽取文本与图片（data URL）。SDK 不
    内置任何分析提示词，措辞由调用方决定。

    Args:
        content: 提取结果。
        prompt: 置于最前的分析提示词（产品文案）。
        text_label: 抽取文本块的标签。默认为 `"Attachment text"`。

    Returns:
        可作为 `messages[].content` 传给 OpenAI 兼容多模态接口的部件列表。
    """
    parts: list[dict[str, Any]] = []
    if prompt:
        parts.append({"type": "text", "text": prompt})
    header = f"kind={content.kind}; media_type={content.media_type}; size_bytes={content.size_bytes}"
    if content.total_pages is not None:
        header += f"; total_pages={content.total_pages}"
    if content.truncated:
        header += "; text_truncated=true"
    if content.note:
        header += f"; note={content.note}"
    parts.append({"type": "text", "text": header})
    if content.text:
        parts.append({"type": "text", "text": f"{text_label}:\n{content.text}"})
    for image in content.images:
        parts.append({"type": "image_url", "image_url": {"url": _data_url(image.data, image.media_type)}})
    return parts

extractor

feishu.attachments.extractor ¶

ExtractLimits `dataclass` ¶

ExtractedImage `dataclass` ¶

ExtractedContent `dataclass` ¶

AttachmentExtractor ¶

extract `async` ¶

SandboxedAttachmentExtractor ¶

`timeout_seconds` ¶

`max_bytes` ¶

`max_concurrency` ¶

`limits` ¶

extract_content ¶

detect_media_type ¶

media_type_from_metadata ¶

image_dimensions ¶

to_openai_content_parts ¶

`content` ¶

`prompt` ¶

`text_label` ¶

extractor

feishu.attachments.extractor ¶

ExtractLimits dataclass ¶

ExtractedImage dataclass ¶

ExtractedContent dataclass ¶

AttachmentExtractor ¶

extract async ¶

SandboxedAttachmentExtractor ¶

timeout_seconds ¶

max_bytes ¶

max_concurrency ¶

limits ¶

extract_content ¶

detect_media_type ¶

media_type_from_metadata ¶

image_dimensions ¶

to_openai_content_parts ¶

content ¶

prompt ¶

text_label ¶

ExtractLimits `dataclass` ¶

ExtractedImage `dataclass` ¶

ExtractedContent `dataclass` ¶

extract `async` ¶

`timeout_seconds` ¶

`max_bytes` ¶

`max_concurrency` ¶

`limits` ¶

`content` ¶

`prompt` ¶

`text_label` ¶