跳转至

attachments

feishu.attachments

AttachmentExtractor

Bases: Protocol

附件提取协议:把附件字节抽取为 feishu.attachments.extractor.ExtractedContent

内置实现为 feishu.attachments.extractor.SandboxedAttachmentExtractor。该协议标注了 runtime_checkable

源代码位于: feishu/attachments/extractor.py
Python
@runtime_checkable
class AttachmentExtractor(Protocol):
    r"""
    附件提取协议:把附件字节抽取为 [feishu.attachments.extractor.ExtractedContent][]。

    内置实现为 [feishu.attachments.extractor.SandboxedAttachmentExtractor][]。该协议标注了 `runtime_checkable`。
    """

    async def extract(
        self, data: bytes, *, file_metadata: Mapping[str, Any], media_type: str | None = None
    ) -> ExtractedContent:
        r"""抽取附件内容;`media_type` 省略时按魔数 / 文件名推断。"""
        ...

extract async

Python
extract(data: bytes, *, file_metadata: Mapping[str, Any], media_type: str | None = None) -> ExtractedContent

抽取附件内容;media_type 省略时按魔数 / 文件名推断。

源代码位于: feishu/attachments/extractor.py
Python
async def extract(
    self, data: bytes, *, file_metadata: Mapping[str, Any], media_type: str | None = None
) -> ExtractedContent:
    r"""抽取附件内容;`media_type` 省略时按魔数 / 文件名推断。"""
    ...

ExtractedContent dataclass

附件提取结果:分类、媒体类型、抽取文本、图片与元信息。

note 为中性英文说明(如体积/像素超限、提取超时),不含任何产品提示词。

示例:

Python Console Session
>>> ExtractedContent(kind="text", media_type="text/plain", text="hi").kind
'text'
源代码位于: feishu/attachments/extractor.py
Python
@dataclass
class ExtractedContent:
    r"""
    附件提取结果:分类、媒体类型、抽取文本、图片与元信息。

    `note` 为中性英文说明(如体积/像素超限、提取超时),不含任何产品提示词。

    Examples:
        >>> ExtractedContent(kind="text", media_type="text/plain", text="hi").kind
        'text'
    """

    kind: str
    media_type: str
    text: str | None = None
    images: list[ExtractedImage] = field(default_factory=list)
    total_pages: int | None = None
    truncated: bool = False
    note: str | None = None
    size_bytes: int = 0

ExtractedImage dataclass

一张可交给多模态模型的图片:原始字节、媒体类型与(若可得)像素尺寸。

源代码位于: feishu/attachments/extractor.py
Python
@dataclass
class ExtractedImage:
    r"""一张可交给多模态模型的图片:原始字节、媒体类型与(若可得)像素尺寸。"""

    data: bytes
    media_type: str
    width: int | None = None
    height: int | None = None

ExtractLimits dataclass

附件提取的各项上限;传给 feishu.attachments.extractor.SandboxedAttachmentExtractor 以覆盖默认值。

源代码位于: feishu/attachments/extractor.py
Python
@dataclass(frozen=True)
class ExtractLimits:
    r"""附件提取的各项上限;传给 [feishu.attachments.extractor.SandboxedAttachmentExtractor][] 以覆盖默认值。"""

    image_max_pixels: int = IMAGE_MAX_PIXELS
    image_max_bytes: int = 5 * 1024 * 1024
    inline_text_max_chars: int = INLINE_TEXT_MAX_CHARS
    extract_text_max_chars: int = EXTRACT_TEXT_MAX_CHARS
    pdf_extract_max_pages: int = PDF_EXTRACT_MAX_PAGES
    pdf_render_max_pages: int = PDF_RENDER_MAX_PAGES
    pdf_render_zoom: float = PDF_RENDER_ZOOM
    pdf_render_max_pixels: int = PDF_RENDER_MAX_PIXELS
    pdf_render_max_image_bytes: int = PDF_RENDER_MAX_IMAGE_BYTES
    office_zip_max_entries: int = OFFICE_ZIP_MAX_ENTRIES
    office_zip_max_uncompressed_bytes: int = OFFICE_ZIP_MAX_UNCOMPRESSED_BYTES
    docx_max_tables: int = DOCX_MAX_TABLES
    docx_max_table_rows: int = DOCX_MAX_TABLE_ROWS
    xlsx_max_sheets: int = XLSX_MAX_SHEETS
    xlsx_max_rows_per_sheet: int = XLSX_MAX_ROWS_PER_SHEET
    pptx_max_slides: int = PPTX_MAX_SLIDES

SandboxedAttachmentExtractor

在可被 SIGKILL 终止的子进程中运行附件提取的默认 feishu.attachments.extractor.AttachmentExtractor

超过 max_bytes 直接拒绝;其余在子进程中按 feishu.attachments.extractor.ExtractLimits 提取,并以 timeout_seconds 硬超时(子进程会被 terminate/kill),以信号量限制并发,从而抵御 zip bomb / 像素炸弹 / 解析挂死等拒绝服务向量。任何异常都降级为带 noteunknown 结果,绝不抛出。

参数:

名称 类型 描述 默认

timeout_seconds

float

单个附件的提取硬超时秒数。默认为 20

20.0

max_bytes

int

允许提取的最大字节数。默认为 16 MiB

16 * 1024 * 1024

max_concurrency

int

同时进行的提取数上限。默认为 2

2

limits

ExtractLimits | None None

示例:

Python Console Session
>>> isinstance(SandboxedAttachmentExtractor(), AttachmentExtractor)
True
源代码位于: feishu/attachments/extractor.py
Python
class SandboxedAttachmentExtractor:
    r"""
    在可被 SIGKILL 终止的子进程中运行附件提取的默认 [feishu.attachments.extractor.AttachmentExtractor][]。

    超过 `max_bytes` 直接拒绝;其余在子进程中按 [feishu.attachments.extractor.ExtractLimits][] 提取,并以
    `timeout_seconds` 硬超时(子进程会被 terminate/kill),以信号量限制并发,从而抵御 zip bomb / 像素炸弹 /
    解析挂死等拒绝服务向量。任何异常都降级为带 `note` 的 `unknown` 结果,绝不抛出。

    Args:
        timeout_seconds: 单个附件的提取硬超时秒数。默认为 `20`。
        max_bytes: 允许提取的最大字节数。默认为 `16 MiB`。
        max_concurrency: 同时进行的提取数上限。默认为 `2`。
        limits: 各项提取上限 [feishu.attachments.extractor.ExtractLimits][]。

    Examples:
        >>> isinstance(SandboxedAttachmentExtractor(), AttachmentExtractor)
        True
    """

    def __init__(
        self,
        *,
        timeout_seconds: float = 20.0,
        max_bytes: int = 16 * 1024 * 1024,
        max_concurrency: int = 2,
        limits: ExtractLimits | None = None,
    ) -> None:
        self.timeout_seconds = timeout_seconds
        self.max_bytes = max_bytes
        self.limits = limits or ExtractLimits()
        self._semaphore = asyncio.Semaphore(max_concurrency)

    async def extract(
        self, data: bytes, *, file_metadata: Mapping[str, Any], media_type: str | None = None
    ) -> ExtractedContent:
        resolved = media_type or detect_media_type(data) or media_type_from_metadata(dict(file_metadata))
        if len(data) > self.max_bytes:
            return ExtractedContent(
                kind="unknown",
                media_type=resolved,
                size_bytes=len(data),
                note=f"attachment too large ({len(data)} bytes > {self.max_bytes} limit)",
            )
        name = _attachment_name(dict(file_metadata))
        async with self._semaphore:
            try:
                return await asyncio.wait_for(
                    asyncio.to_thread(_run_killable_extract, data, resolved, name, self.limits, self.timeout_seconds),
                    timeout=self.timeout_seconds + 10,  # backstop strictly exceeds the inner kill-and-cleanup budget
                )
            except asyncio.TimeoutError:
                return ExtractedContent(
                    kind="unknown", media_type=resolved, size_bytes=len(data), note="extraction timed out"
                )
            except Exception as exc:  # noqa: BLE001 — extraction must never raise into the caller
                return ExtractedContent(
                    kind="unknown",
                    media_type=resolved,
                    size_bytes=len(data),
                    note=f"extraction failed: {type(exc).__name__}",
                )

detect_media_type

Python
detect_media_type(data: bytes) -> str | None

按魔数(magic bytes)识别常见图片 / PDF 类型;无法识别返回 None

源代码位于: feishu/attachments/extractor.py
Python
def detect_media_type(data: bytes) -> str | None:
    r"""按魔数(magic bytes)识别常见图片 / PDF 类型;无法识别返回 `None`。"""
    if data.startswith(b"\xff\xd8\xff"):
        return "image/jpeg"
    if data.startswith(b"\x89PNG\r\n\x1a\n"):
        return "image/png"
    if data.startswith(b"GIF87a") or data.startswith(b"GIF89a"):
        return "image/gif"
    if data.startswith(b"%PDF"):
        return "application/pdf"
    if len(data) >= 12 and data[0:4] == b"RIFF" and data[8:12] == b"WEBP":
        return "image/webp"
    return None

extract_content

Python
extract_content(data: bytes, *, media_type: str, name: str | None, limits: ExtractLimits) -> ExtractedContent

同步抽取附件内容(在沙箱子进程内运行);按媒体类型分派到图片 / PDF / 文本 / Office 提取。

源代码位于: feishu/attachments/extractor.py
Python
def extract_content(data: bytes, *, media_type: str, name: str | None, limits: ExtractLimits) -> ExtractedContent:
    r"""同步抽取附件内容(在沙箱子进程内运行);按媒体类型分派到图片 / PDF / 文本 / Office 提取。"""
    size = len(data)
    if media_type in _IMAGE_MIME_TYPES:
        if len(data) > limits.image_max_bytes:
            return ExtractedContent(
                kind="image", media_type=media_type, size_bytes=size, note="image exceeds the byte limit; omitted"
            )
        dims = image_dimensions(data, media_type) or _pil_dimensions(data, limits)
        if dims is None:  # fail closed: never forward an image whose dimensions we could not verify
            return ExtractedContent(
                kind="image",
                media_type=media_type,
                size_bytes=size,
                note="image dimensions could not be verified; omitted",
            )
        width, height = dims
        if width <= 0 or height <= 0 or width * height > limits.image_max_pixels:
            return ExtractedContent(
                kind="image", media_type=media_type, size_bytes=size, note="image exceeds the pixel limit; omitted"
            )
        return ExtractedContent(
            kind="image",
            media_type=media_type,
            size_bytes=size,
            images=[ExtractedImage(data, media_type, width, height)],
        )

    if _is_pdf(media_type, name):
        text, images, total_pages, truncated = extract_pdf_content(data, limits)
        return ExtractedContent(
            kind="pdf",
            media_type=media_type,
            size_bytes=size,
            text=text,
            total_pages=total_pages,
            truncated=truncated,
            images=[ExtractedImage(image, "image/png") for image in images],
            note=None if (text or images) else "no text or page images could be extracted",
        )

    text, kind, truncated = extract_text_like(data, media_type=media_type, name=name, limits=limits)
    if text is not None:
        return ExtractedContent(kind=kind, media_type=media_type, size_bytes=size, text=text, truncated=truncated)
    return ExtractedContent(
        kind="unknown",
        media_type=media_type,
        size_bytes=size,
        note="unsupported binary format with no extractable text",
    )

image_dimensions

Python
image_dimensions(data: bytes, media_type: str) -> tuple[int, int] | None

仅从文件头解析图片像素尺寸(不整图解码),用于像素炸弹防护。

源代码位于: feishu/attachments/extractor.py
Python
def image_dimensions(data: bytes, media_type: str) -> tuple[int, int] | None:
    r"""仅从文件头解析图片像素尺寸(不整图解码),用于像素炸弹防护。"""
    if media_type == "image/png" and len(data) >= 24 and data.startswith(b"\x89PNG\r\n\x1a\n"):
        return int.from_bytes(data[16:20], "big"), int.from_bytes(data[20:24], "big")
    if media_type == "image/gif" and len(data) >= 10 and (data.startswith(b"GIF87a") or data.startswith(b"GIF89a")):
        return int.from_bytes(data[6:8], "little"), int.from_bytes(data[8:10], "little")
    if media_type == "image/jpeg" and data.startswith(b"\xff\xd8"):
        return _jpeg_dimensions(data)
    if media_type == "image/webp" and len(data) >= 30 and data[0:4] == b"RIFF" and data[8:12] == b"WEBP":
        return _webp_dimensions(data)
    return None

media_type_from_metadata

Python
media_type_from_metadata(file_metadata: Mapping[str, Any]) -> str

从飞书附件元信息(mime_type / 文件名)推断媒体类型;未知时返回 application/octet-stream

源代码位于: feishu/attachments/extractor.py
Python
def media_type_from_metadata(file_metadata: Mapping[str, Any]) -> str:
    r"""从飞书附件元信息(`mime_type` / 文件名)推断媒体类型;未知时返回 `application/octet-stream`。"""
    mime_type = file_metadata.get("mime_type")
    if isinstance(mime_type, str) and "/" in mime_type:
        return mime_type
    name = _attachment_name(file_metadata)
    if name:
        guessed, _ = mimetypes.guess_type(name)
        if guessed:
            return guessed
    return "application/octet-stream"

to_openai_content_parts

Python
to_openai_content_parts(content: ExtractedContent, *, prompt: str | None = None, text_label: str = 'Attachment text') -> list[dict[str, Any]]

feishu.attachments.extractor.ExtractedContent 转为 OpenAI Chat Completions 的多模态 content 列表。

prompt(产品提供的分析提示词)作为首个文本块;随后是中性的元信息、抽取文本与图片(data URL)。SDK 不 内置任何分析提示词,措辞由调用方决定。

参数:

名称 类型 描述 默认

content

ExtractedContent

提取结果。

必需

prompt

str | None

置于最前的分析提示词(产品文案)。

None

text_label

str

抽取文本块的标签。默认为 "Attachment text"

'Attachment text'

返回:

类型 描述
list[dict[str, Any]]

可作为 messages[].content 传给 OpenAI 兼容多模态接口的部件列表。

源代码位于: feishu/attachments/extractor.py
Python
def to_openai_content_parts(
    content: ExtractedContent, *, prompt: str | None = None, text_label: str = "Attachment text"
) -> list[dict[str, Any]]:
    r"""
    把 [feishu.attachments.extractor.ExtractedContent][] 转为 OpenAI Chat Completions 的多模态 `content` 列表。

    `prompt`(产品提供的分析提示词)作为首个文本块;随后是中性的元信息、抽取文本与图片(data URL)。SDK 不
    内置任何分析提示词,措辞由调用方决定。

    Args:
        content: 提取结果。
        prompt: 置于最前的分析提示词(产品文案)。
        text_label: 抽取文本块的标签。默认为 `"Attachment text"`。

    Returns:
        可作为 `messages[].content` 传给 OpenAI 兼容多模态接口的部件列表。
    """
    parts: list[dict[str, Any]] = []
    if prompt:
        parts.append({"type": "text", "text": prompt})
    header = f"kind={content.kind}; media_type={content.media_type}; size_bytes={content.size_bytes}"
    if content.total_pages is not None:
        header += f"; total_pages={content.total_pages}"
    if content.truncated:
        header += "; text_truncated=true"
    if content.note:
        header += f"; note={content.note}"
    parts.append({"type": "text", "text": header})
    if content.text:
        parts.append({"type": "text", "text": f"{text_label}:\n{content.text}"})
    for image in content.images:
        parts.append({"type": "image_url", "image_url": {"url": _data_url(image.data, image.media_type)}})
    return parts