如何上传大文件（≥3GB）到FastAPI后端？

2024-03-08

我正在尝试将大文件（≥3GB）上传到我的 FastAPI 服务器，without将整个文件加载到内存中，因为我的服务器只有 2GB 可用内存。

服务器端:

async def uploadfiles(upload_file: UploadFile = File(...):

客户端:

m = MultipartEncoder(fields = {"upload_file":open(file_name,'rb')})
prefix = "http://xxx:5000"
url = "{}/v1/uploadfiles".format(prefix)
try:
    req = requests.post(
    url,
    data=m,
    verify=False,
            )

HTTP 422 {"detail":[{"loc":["body","upload_file"],"msg":"field required","type":"value_error.missing"}]}

我不知道什么MultipartEncoder实际上发送到服务器，因此请求不匹配。有任何想法吗？

With requests-toolbelt图书馆，你必须通过filename同样，在声明时field for upload_file，也set the Content-Type header—这是您收到错误的主要原因，因为您在发送请求时没有设置Content-Type标头至multipart/form-data，然后是必要的boundary https://stackoverflow.com/questions/3508338/what-is-the-boundary-in-multipart-form-data字符串——如图所示文档 https://toolbelt.readthedocs.io/en/latest/uploading-data.html。例子：

filename = 'my_file.txt'
m = MultipartEncoder(fields={'upload_file': (filename, open(filename, 'rb'))})
r = requests.post(url, data=m, headers={'Content-Type': m.content_type})
print(r.request.headers)  # confirm that the 'Content-Type' header has been set

但是，我不建议使用库（即requests-toolbelt https://github.com/requests/toolbelt）已经三年多没有提供新版本了。我建议改用 Python requests，如中所示这个答案 https://stackoverflow.com/a/70657621/17865804 and 那个答案 https://stackoverflow.com/a/70641755/17865804（另见流式上传和块编码请求 https://requests.readthedocs.io/en/latest/user/advanced/#streaming-uploads），或者最好使用HTTPX https://github.com/encode/httpx/库，它支持async请求（如果您必须同时发送多个请求）以及流式传输File默认情况下上传，这意味着一次只会加载一个块到内存中（请参阅文档 https://www.python-httpx.org/advanced/#multipart-file-encoding）。下面给出示例。

选项 1（快速）- 上传`File` and `Form`数据使用`.stream()`

正如之前详细解释的这个答案 https://stackoverflow.com/a/70667530/17865804，当你声明一个UploadFile https://fastapi.tiangolo.com/tutorial/request-files/#uploadfile对象 FastAPI/Starlette 在底层使用了SpooledTemporaryFile与max_size属性设置为1MB，这意味着文件数据将在内存中进行假脱机处理，直到文件大小超过max_size，此时内容被写入磁盘；更具体地说，对于一个temporary文件位于操作系统的临时目录中 - 请参阅这个答案 https://stackoverflow.com/a/71377044/17865804关于如何查找/更改默认临时目录 - 您稍后需要使用以下命令从中读取数据.read()方法。因此，整个过程使得上传文件变得非常慢；特别是，如果它是一个大文件（正如您稍后将在下面的选项 2 中看到的那样）。

为了避免这种情况并加快这一过程，正如上面链接的答案所建议的，人们可以访问request身体如溪流。按照星光文档 https://www.starlette.io/requests/#body，如果您使用.stream() https://github.com/encode/starlette/blob/b8ea367b4304a98653ec8ce9c794ad0ba6dcaf4b/starlette/requests.py#L208方法中，提供（请求）字节块，而不将整个主体存储到内存（如果主体大小超过 1MB，则稍后存储到临时文件）。此方法允许您在字节块到达时读取和处理它们。下面通过使用建议的解决方案更进一步streaming-form-data https://github.com/siddhantgoel/streaming-form-data库，它提供了一个用于解析流的Python解析器multipart/form-data输入块。这意味着不仅您可以上传Form数据连同File(s)，但您也不必等待接收到整个请求正文才能开始解析数据。完成的方法是初始化主解析器类（传递 HTTP 请求headers有助于确定输入Content-Type，因此，boundary https://stackoverflow.com/questions/3508338/what-is-the-boundary-in-multipart-form-data用于分隔多部分有效负载中的每个主体部分等的字符串），并关联其中一个Target https://streaming-form-data.readthedocs.io/en/latest/#target-classes类来定义从请求正文中提取字段后应如何处理该字段。例如，FileTarget https://streaming-form-data.readthedocs.io/en/latest/#filetarget会将数据流式传输到磁盘上的文件，而ValueTarget https://streaming-form-data.readthedocs.io/en/latest/#valuetarget将数据保存在内存中（此类可用于Form or File数据，如果您不需要将文件保存到磁盘）。也可以定义自己的custom Target classes https://streaming-form-data.readthedocs.io/en/latest/#custom-target-classes。我不得不提的是streaming-form-data https://github.com/siddhantgoel/streaming-form-data库目前不支持async调用 I/O 操作，意味着发生块写入sync错误地（在一个def功能）。不过，正如下面的端点使用的那样.stream() https://github.com/encode/starlette/blob/b8ea367b4304a98653ec8ce9c794ad0ba6dcaf4b/starlette/requests.py#L208（这是一个async函数），它将放弃对在事件循环上运行的其他任务/请求的控制，同时等待数据从流中变得可用。您还可以在单独的线程中运行用于解析接收到的数据的函数，await它，使用 Starlette 的run_in_threadpool() https://github.com/encode/starlette/blob/b8ea367b4304a98653ec8ce9c794ad0ba6dcaf4b/starlette/concurrency.py#L35—e.g., await run_in_threadpool(parser.data_received, chunk)——当您调用时，FastAPI 在内部使用它async的方法UploadFile，如图所示here https://github.com/encode/starlette/blob/f6ea760a80d8b109fb6afd1c03e9a33754e6bb5f/starlette/datastructures.py#L456。欲了解更多详情def vs async def，请看一下这个答案 https://stackoverflow.com/a/71517830/17865804.

您还可以执行某些验证任务，例如，确保输入大小不超过特定值。这可以使用以下方法完成MaxSizeValidator https://github.com/siddhantgoel/streaming-form-data/blob/d900c1f750896e7221d7896aab4ff892b91730a2/streaming_form_data/validators.py#L5。但是，由于这只会应用于您定义的字段，因此它不会阻止恶意用户发送极大的请求正文，这可能会导致消耗服务器资源，从而导致应用程序最终崩溃。下面包含一个自定义MaxBodySizeValidator用于确保请求正文大小不超过预定义值的类。上面描述的两个验证器以一种可能比所描述的更好的方式解决了限制上传文件（以及整个请求正文）大小的问题here https://github.com/tiangolo/fastapi/issues/362#issuecomment-584104025，它使用UploadFile，因此，在执行检查之前，需要完全接收文件并将其保存到临时目录（更不用说该方法根本不考虑请求主体大小）——用作 ASGI 中间件，例如this https://github.com/steinnes/content-size-limit-asgi将是限制请求正文的替代解决方案。另外，如果您正在使用独角兽与独角兽 https://fastapi.tiangolo.com/deployment/server-workers/#gunicorn-with-uvicorn-workers，您还可以定义限制，例如，请求中 HTTP 标头字段的数量、HTTP 请求标头字段的大小等（请参阅文档 https://docs.gunicorn.org/en/stable/settings.html?highlight=limit#security）。使用反向代理服务器时可以应用类似的限制，例如 Nginx（它还允许您使用client_max_body_size http://nginx.org/en/docs/http/ngx_http_core_module.html#client_max_body_size指示）。

以下示例的一些注释。由于它使用Request直接反对，而不是UploadFile and Form对象，端点将不会正确记录在自动生成的文档中/docs（如果这对您的应用程序很重要）。这也意味着您必须自己执行一些检查，例如是否收到端点的必填字段，以及它们是否采用预期格式。例如，对于data字段，您可以检查是否data.value是否为空（空意味着用户未将该字段包含在multipart/form-data，或发送一个空值），以及 ifisinstance(data.value, str)。对于文件，您可以检查是否file_.multipart_filename不为空；然而，自从一个filename可能不会被包含在Content-Disposition https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition#as_a_header_for_a_multipart_body对于某些用户，您可能还想检查文件是否存在于文件系统中，使用os.path.isfile(filepath)（注意：您需要确保指定位置不存在同名文件；否则，上述函数将始终返回True，即使用户没有发送文件）。

关于所应用的尺寸限制，MAX_REQUEST_BODY_SIZE下面必须大于MAX_FILE_SIZE（加上所有Form值大小）您期望收到的原始请求正文（您通过使用.stream()方法）包括更多的字节--boundary and Content-Disposition正文中每个字段的标头。因此，您应该添加更多字节，具体取决于Form值和您期望收到的文件数量（因此MAX_FILE_SIZE + 1024 below).

app.py

from fastapi import FastAPI, Request, HTTPException, status
from streaming_form_data import StreamingFormDataParser
from streaming_form_data.targets import FileTarget, ValueTarget
from streaming_form_data.validators import MaxSizeValidator
import streaming_form_data
from starlette.requests import ClientDisconnect
import os

MAX_FILE_SIZE = 1024 * 1024 * 1024 * 4  # = 4GB
MAX_REQUEST_BODY_SIZE = MAX_FILE_SIZE + 1024

app = FastAPI()

class MaxBodySizeException(Exception):
    def __init__(self, body_len: str):
        self.body_len = body_len

class MaxBodySizeValidator:
    def __init__(self, max_size: int):
        self.body_len = 0
        self.max_size = max_size

    def __call__(self, chunk: bytes):
        self.body_len += len(chunk)
        if self.body_len > self.max_size:
            raise MaxBodySizeException(body_len=self.body_len)
 
@app.post('/upload')
async def upload(request: Request):
    body_validator = MaxBodySizeValidator(MAX_REQUEST_BODY_SIZE)
    filename = request.headers.get('Filename')
    
    if not filename:
        raise HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, 
            detail='Filename header is missing')
    try:
        filepath = os.path.join('./', os.path.basename(filename)) 
        file_ = FileTarget(filepath, validator=MaxSizeValidator(MAX_FILE_SIZE))
        data = ValueTarget()
        parser = StreamingFormDataParser(headers=request.headers)
        parser.register('file', file_)
        parser.register('data', data)
        
        async for chunk in request.stream():
            body_validator(chunk)
            parser.data_received(chunk)
    except ClientDisconnect:
        print("Client Disconnected")
    except MaxBodySizeException as e:
        raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, 
           detail=f'Maximum request body size limit ({MAX_REQUEST_BODY_SIZE} bytes) exceeded ({e.body_len} bytes read)')
    except streaming_form_data.validators.ValidationError:
        raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, 
            detail=f'Maximum file size limit ({MAX_FILE_SIZE} bytes) exceeded') 
    except Exception:
        raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, 
            detail='There was an error uploading the file') 
   
    if not file_.multipart_filename:
        raise HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, detail='File is missing')

    print(data.value.decode())
    print(file_.multipart_filename)
        
    return {"message": f"Successfuly uploaded {filename}"}

如前所述，要上传数据（在客户端），您可以使用HTTPX库，默认情况下支持流式文件上传，因此允许您发送大型流/文件，而无需将它们完全加载到内存中。您可以通过额外Form数据也是如此，使用data争论。下面是一个自定义标头，即Filename，用于将文件名传递给服务器，以便服务器实例化FileTarget具有该名称的类（您可以使用X-如果您愿意，自定义标头的前缀；然而，它是官方不再推荐 https://stackoverflow.com/questions/3561381/custom-http-headers-naming-conventions).

要上传多个文件，请为每个文件使用标头（或者，在服务器端使用随机名称，文件完全上传后，您可以选择使用file_.multipart_filename属性），传递文件列表，如中所述文档 https://www.python-httpx.org/advanced/#multipart-file-encoding（注意：为每个文件使用不同的字段名称，以便在服务器端解析它们时它们不会重叠，例如，files = [('file', open('bigFile.zip', 'rb')),('file_2', open('bigFile2.zip', 'rb'))]，最后定义Target相应地在服务器端提供类。

test.py

import httpx
import time

url ='http://127.0.0.1:8000/upload'
files = {'file': open('bigFile.zip', 'rb')}
headers={'Filename': 'bigFile.zip'}
data = {'data': 'Hello World!'}

with httpx.Client() as client:
    start = time.time()
    r = client.post(url, data=data, files=files, headers=headers)
    end = time.time()
    print(f'Time elapsed: {end - start}s')
    print(r.status_code, r.json(), sep=' ')

两者都上传`File` and `JSON` body

如果您想上传两个文件）和 JSON 而不是Form数据，您可以使用方法3中描述的方法这个答案 https://stackoverflow.com/a/70640522/17865804，从而也使您无需对收到的信息执行手动检查Form字段，如前所述（有关更多详细信息，请参阅链接的答案）。为此，请对上面的代码进行以下更改。

app.py

#...
from fastapi import Form
from pydantic import BaseModel, ValidationError
from typing import Optional
from fastapi.encoders import jsonable_encoder

#...

class Base(BaseModel):
    name: str
    point: Optional[float] = None
    is_accepted: Optional[bool] = False
  
def checker(data: str = Form(...)):
    try:
        return Base.parse_raw(data)
    except ValidationError as e:
        raise HTTPException(detail=jsonable_encoder(e.errors()), status_code=status.HTTP_422_UNPROCESSABLE_ENTITY)
        

@app.post('/upload')
async def upload(request: Request):
    #...
    
    # place the below after the try-except block in the example given earlier
    model = checker(data.value.decode())
    print(dict(model))

test.py

#...
import json

data = {'data': json.dumps({"name": "foo", "point": 0.13, "is_accepted": False})}
#...

选项 2（慢）- 上传`File` and `Form`数据使用`UploadFile` and `Form`

如果你想使用普通的def相反，参见端点这个答案 https://stackoverflow.com/questions/63048825/how-to-upload-file-using-fastapi/70657621#70657621.

app.py

from fastapi import FastAPI, File, UploadFile, Form, HTTPException, status
import aiofiles
import os

CHUNK_SIZE = 1024 * 1024  # adjust the chunk size as desired
app = FastAPI()

@app.post("/upload")
async def upload(file: UploadFile = File(...), data: str = Form(...)):
    try:
        filepath = os.path.join('./', os.path.basename(file.filename))
        async with aiofiles.open(filepath, 'wb') as f:
            while chunk := await file.read(CHUNK_SIZE):
                await f.write(chunk)
    except Exception:
        raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, 
            detail='There was an error uploading the file')
    finally:
        await file.close()

    return {"message": f"Successfuly uploaded {file.filename}"}

如前所述，使用此选项将需要更长的时间才能完成文件上传，并且HTTPX使用 5 秒的默认超时，您很可能会得到ReadTimeout异常（因为服务器需要一些时间来读取SpooledTemporaryFile分块并将内容写入磁盘上的永久位置）。因此，您可以配置超时 https://www.python-httpx.org/advanced/#timeout-configuration（参见Timeout https://github.com/encode/httpx/blob/9baf3a6cd2fa9ebeb17dba5a3e5c6e9e0af83a96/httpx/_config.py#L189源代码中的类），更具体地说，readtimeout，它“指定等待接收数据块（例如响应正文的块）的最大持续时间”。如果设置为None而不是一些正数值，不会有超时read.

test.py

import httpx
import time

url ='http://127.0.0.1:8000/upload'
files = {'file': open('bigFile.zip', 'rb')}
headers={'Filename': 'bigFile.zip'}
data = {'data': 'Hello World!'}
timeout = httpx.Timeout(None, read=180.0)

with httpx.Client(timeout=timeout) as client:
    start = time.time()
    r = client.post(url, data=data, files=files, headers=headers)
    end = time.time()
    print(f'Time elapsed: {end - start}s')
    print(r.status_code, r.json(), sep=' ')

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)