当前位置: 首页 > news >正文

Stop Using CSVs for Storage – Here Are the Top 5 Alternatives

CSVs cost you time, disk space, and money. Here are five alternatives every data scientist must know.

Dario Radečić

Sep 20, 2021

7 min read

Share

Photo by jaikishan patel on Unsplash

Everyone and their grandmother know what a CSV file is.But is it the optimal way to store data?Heck no. It’s probably the worst storage format if you don’t plan to view or edit data on the fly.

If you’re storing large volumes of data, opting for CSVs will cost you both time and money.

Today you’ll learn about five CSV alternatives. Each provides an advantage, either in read/write time or in file size. Some are even better in all areas.

Let’s set up the environment before going over the file formats.


Getting started – Environment setup

You’ll need a couple of libraries to follow along. The best practice is to install them inside a virtual environment, so that’s exactly what you’ll do. The following code snippet creates a new virtual environment through Anaconda and installs every required library:

conda create --name file_formats python=3.8 conda activate file_formats conda install -c conda forge numpy pandas fastavro pyarrow feather-format jupyter jupyterlab

Once the installation finishes, you can execute the following command to start a JupyterLab session:

jupyter lab

The next step is to import the libraries and create an arbitrary dataset. You’ll make one with 5 columns and 10M rows:

import numpy as np import pandas as pd import feather import pickle import pyarrow as pa import pyarrow.orc as orc from fastavro import writer, reader, parse_schema np.random.seed = 42 df_size = 10_000_000 df = pd.DataFrame({ 'a': np.random.rand(df_size), 'b': np.random.rand(df_size), 'c': np.random.rand(df_size), 'd': np.random.rand(df_size), 'e': np.random.rand(df_size) }) df.head()

Here’s how it looks like:

Image 1 – Random dummy dataset (image by author)

You now have everything needed to start experimenting with different data formats. Let’s cover ORC first.


ORC

ORC stands forOptimized Row Columnar. It’s a data format optimized for reads and writes in Hive. As Hive is painfully slow, folks at Hortonworks decided to develop the ORC file format to speed it up.

In Python, you can use theread_orc()function from Pandas to read ORC files. Unfortunately, there’s no alternative function for writing ORC files, so you’ll have to use PyArrow.

Here’s an example of writing Pandas DataFrames:

table = pa.Table.from_pandas(df, preserve_index=False) orc.write_table(table, '10M.orc')

And here’s the command for reading ORC files:

df = pd.read_orc('10M.orc')

You can learn more about ORC here:

Stop Using CSVs for Storage – This File Format is Faster and Lighter


Avro

Avro is an open-source project which provides services of data serialization and exchange for Apache Hadoop. It stores a JSON-like schema with the data, so the correct data types are known in advance. That’s where the compression happens.

Avro has an API for every major programming language, but it doesn’t support Pandas by default.

Here’s the set of commands for saving a Pandas DataFrame to an Avro file:

# 1. Define the schema schema = { 'doc': 'Float data', 'name': 'Data', 'namespace': 'data', 'type': 'record', 'fields': [ {'name': 'a', 'type': 'float'}, {'name': 'b', 'type': 'float'}, {'name': 'c', 'type': 'float'}, {'name': 'd', 'type': 'float'}, {'name': 'e', 'type': 'float'}, ] } parsed_schema = parse_schema(schema) # 2. Convert pd.DataFrame to records - list of dictionaries records = df.to_dict('records') # 3. Write to Avro file with open('10M.avro', 'wb') as out: writer(out, parsed_schema, records)

Reading Avro files is no picnic either:

# 1. List to store the records avro_records = [] # 2. Read the Avro file with open('10M.avro', 'rb') as fo: avro_reader = reader(fo) for record in avro_reader: avro_records.append(record) # 3. Convert to pd.DataFrame df = pd.DataFrame(avro_records)

You can learn more about Avro here:

CSV Files for Storage? Absolutely Not. Use Apache Avro Instead


Parquet

Apache Parquet is a data storage format designed for efficiency. The reason behind this is the column storage architecture, as it allows you to skip data that isn’t relevant quickly. This way, both queries and aggregations are faster, resulting in hardware savings.

The best news is – Pandas has full support for Parquet files.

Here’s the command for writing a Pandas DataFrame to a Parquet file:

df.to_parquet('10M.parquet')

And here’s the equivalent for reading:

df = pd.read_parquet('10M.parquet')

You can learn more about Parquet here:

CSV Files for Storage? No Thanks. There’s a Better Option


Pickle

You can use thepicklemodule to serialize objects and save them to a file. Likewise, you can then deserialize the serialized file to load them back when needed. Pickle has one major advantage over other formats – you can use it to store any Python object. One of the most widely used functionalities is saving machine learning models after the training is complete.

The biggest downside is that Pickle is Python-specific, so cross-language support isn’t guaranteed. It could be a deal-breaker for any project requiring data communication between Python and R, for example.

Here’s how to write a Pandas DataFrame to a Pickle file:

with open('10M.pkl', 'wb') as f: pickle.dump(df, f)

You’ll only have to change the file mode when reading a Pickle file:

with open('10M.pkl', 'rb') as f: df = pickle.load(f)

You can learn more about Pickle here:

Stop Using CSVs for Storage – Pickle is an 80 Times Faster Alternative


Feather

Feather is a data format for storing data frames. It’s designed around a simple premise – to push data frames in and out of memory as efficiently as possible. It was initially designed for fast communication between Python and R, but you’re not limited to this use case.

You can use thefeatherlibrary to work with Feather files in Python. It’s the fastest available option currently.

Here’s the command for saving Pandas DataFrames to a Feather file:

feather.write_dataframe(df, '10M.feather')

And here’s the command for reading:

df = feather.read_dataframe('10M.feather')

You can learn more about Feather here:

Stop Using CSVs for Storage – This File Format Is 150 Times Faster


Comparison time – Which data format should you use?

Many highly optimized file formats are useless if you need to change or even view the data on the fly. If that’s not the case, you should generally avoid CSVs.

Below is a comparison between CSV and every other mentioned data format in write time. The goal was to save previously created 10Mx5 dataset locally:

Image 2 – Write time comparison in seconds (CSV: 34.7; ORC: 9.66; Avro: 9.58; Parquet: 2.06; Pickle: 0.5; Feather: 0.304) (image by author)

The differences are astronomical. Feather is about 115 times faster than CSV for storing identical datasets. Even if you decide to go with something more compatible such as Parquet, there’s still a 17 times decrease in write time.

Let’s talk about the read times next. The goal is to compare how much time it takes to read identical datasets in different formats:

Image 3 – Read time comparison in seconds (CSV: 3.83; ORC: 3.02, Avro: 27.6; Parquet: 1.23; Pickle: 0.193; Feather: 1.16) (image by author)

CSVs aren’t so terrible here. Apache Avro is the absolute worse due to required parsing. Pickle is the fastest one, so it looks like the most promising option if you’re working only in Python.

And finally, let’s compare the file sizes on the disk:

Image 4 – File size comparison in MB (CSV: 963.5; ORC: 379.5; Avro: 200.3; Parquet: 401.4; Pickle: 400; Feather: 400.1) (image by author)

Things don’t look good for CSVs. The file size reduction goes from 2.4x to 4.8x, depending on the file format.

To summarize, if you store gigabytes of data daily, choosing the correct file format is crucial. If you’re working only in Python, you can’t go wrong with Pickle. If you need something a bit more versatile, go with any other mentioned format.


What are your thoughts on these CSV alternatives? Which one(s) do you use if viewing and editing data on the fly isn’t required?Let me know in the comments below.

http://www.cnnetsun.cn/news/138394.html

相关文章:

  • 2025年战略咨询在行业标准演进中的推动力
  • 【电商API接口】电商平台价格监控行业全景:数据驱动的定价革命
  • java计算机毕业设计蔬菜配送系统 生鲜直配平台的设计与实现 社区蔬菜一站式采购与配送管理系统
  • dubbo源码之一次RPC请求的生死之旅(基于Dubbo 2.7.8)
  • 基于SpringBoot+Vue的web城乡居民基本医疗信息管理系统管理系统设计与实现【Java+MySQL+MyBatis完整源码】
  • 【完整源码+数据集+部署教程】手势与标志识别检测系统源码[一条龙教学YOLOV8标注好的数据集一键训练_70+全套改进创新点发刊_Web前端展示]
  • 03.统计学机器学习
  • [Poi2011]Lightning Conductor题解
  • 一文读懂大模型:收藏级教程,助你从入门到精通
  • Nginx云计算大数据——安装AND版本升级(普通升级+平滑升级+失败回滚)
  • GPT-5.2 实测数据流出:逻辑推理性能翻倍,大模型“幻觉”真的被终结了吗?
  • SQL SERVER——通过计划任务方式每月对配置数据、审计数据等进行备份
  • 前端——跨平台桌面应用开发实践
  • OpenAI 的反击!GPT-5.2 强行拉开代差,Gemini 3 和 Claude 4 还有机会吗?
  • 零售打工人加薪难?靠这张证,我在激烈竞争里站稳了脚跟
  • 基于springboot的多媒体素材库的开发与应用毕业论文+PPT(附源代码+演示视频)
  • 从离线语音到多模态智能体四博智联 AI 硬件整体解决方案全景解析
  • 我发现跨医院联合训练让诊断准确率飙升后来才知道是横向联邦学习在数据孤岛中的绝招
  • 性能压测工具:wrk
  • 论文引用标注工具排名2025:6大平台+自动规范推荐
  • Kotaemon AWS EC2部署实例:国际业务首选
  • 实在没货,简历(软件测试)咋写?
  • 网约车服务端线上流量巡检与测试验收技术
  • 公考日记7
  • 火电一次调频、自抗扰调频及群智能算法智能调频在MATLAB/Simulink中的应用
  • 科研实验室温湿度监控新范式:以太网 POE 技术全场景解决方案
  • RV1126 NO.57:ROCKX+RV1126人脸识别推流项目之读取人脸图片并把特征值保存到sqlite3数据库
  • 探索SAR ADC:45nm工艺下的高速高精度设计
  • 【小增长技术团队东哥分享】Electron vs Electron-Vite vs Electron-Egg:桌面端开发到底该选谁?
  • 测试价值的量化评估:从成本中心到价值证明的路径探索