• 首页 首页 icon
  • 工具库 工具库 icon
    • IP查询 IP查询 icon
  • 内容库 内容库 icon
    • 快讯库 快讯库 icon
    • 精品库 精品库 icon
    • 问答库 问答库 icon
  • 更多 更多 icon
    • 服务条款 服务条款 icon

使用dask / dask-cudf单个大型实木复合地板文件读取到多个分区



我正在尝试使用 dask_cudf / <$读取单个大的 parquet 文件(大小> gpu_size) c $ c> dask ,但它目前正在将其读取到单个分区中,我猜这是从文档字符串推断出的预期行为:

I am trying to read a single large parquet file (size > gpu_size), using dask_cudf/dask but it is currently reading it into a single partition, which i am guessing is the expected behavior inferring from the doc-string:

dask.dataframe.read_parquet(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto', gather_statistics=None, **kwargs):

    Read a Parquet file into a Dask DataFrame
    This reads a directory of Parquet data into a Dask.dataframe, one file per partition. 
    It selects the index among the sorted columns if any exist.


Is there a work-around i can do read it into multiple partitions ?



镶木地板数据集可以保存到单独的文件中。每个文件可以包含单独的行组。 Dask Dataframe将每个Parquet行组读入一个单独的分区。

Parquet datasets can be saved into separate files. Each file may contain separate row groups. Dask Dataframe reads each Parquet row group into a separate partition.


Based on what you're saying it sounds like your dataset has only a single row group. If that is the case then unfortunately there is nothing that Dask can really do here.


You might want to go back to the source of the data to see how it was saved and verify that whatever process is saving this dataset does it in a way where it is not creating very large row groups.


  • 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
  • 本站站名: 学新通技术网
  • 本文地址: /reply/detail/tanhecfhcj
更多 icon
更多 icon