ParaFS: A Log-Structured File System to Exploit the Internal

This paper is included in the Proceedings of the

2016 USENIX Annual Technical Conference (USENIC ATC ’16).

June 22–24, 2016 • Denver, CO, USA

978-1-931971-30-0

Open access to the Proceedings of the

2016 USENIX Annual Technical Conference

(USENIX ATC ’16) is sponsored by USENIX.

ParaFS: A Log-Structured File System to Exploit

the Internal Parallelism of Flash Devices

Jiacheng Zhang, Jiwu Shu, and Youyou Lu, Tsinghua University

https://www.usenix.org/conference/atc16/technical-sessions/presentation/zhang

USENIX Association 2016 USENIX Annual Technical Conference 87

ParaFS: A Log-Structured File System

to Exploit the Internal Parallelism of Flash Devices

Jiacheng Zhang Jiwu Shu

∗

Youyou Lu

Department of Computer Science and Technology, Tsinghua University

Tsinghua National Laboratory for Information Science and Technology

[email protected] {shujw, luyouyou}@tsinghua.edu.cn

Abstract

File system designs are undergoing rapid evolution to

exploit the potentials of ﬂash memory. However, the

internal parallelism, a key feature of ﬂash devices, is

hard to be leveraged in the ﬁle system level, due to

the semantic gap caused by the ﬂash translation layer

(FTL). We observe that even ﬂash-optimized ﬁle systems

have serious garbage collection problems, which lead to

signiﬁcant performance degradation, for write-intensive

workloads on multi-channel ﬂash devices.

In this paper, we propose ParaFS to exploit the internal

parallelism while ensuring efﬁcient garbage collection.

ParaFS is a log-structured ﬁle system over a simpli-

ﬁed block-level FTL that exposes the physical layout.

With the knowledge of device information, ParaFS ﬁrst

proposes 2-D data allocation, to maintain the hot/cold

data grouping in ﬂash memory while exploiting channel-

level parallelism. ParaFS then coordinates the garbage

collection in both FS and FTL levels, to make garbage

collection more efﬁcient. In addition, ParaFS sched-

ules read/write/erase requests over multiple channels

to achieve consistent performance. Evaluations show

that ParaFS effectively improves system performance for

write-intensive workloads by 1.6× to 3.1×, compared to

the ﬂash-optimized F2FS ﬁle system.

1 Introduction

Flash memory has been widely adopted across embedded

systems to data centers in the past few years. In the

device level, ﬂash devices outperform hard disk drives

(HDDs) by orders of magnitude in terms of both latency

and bandwidth. In the system level, the latency beneﬁt

has been made visible to the software by redesigning

the I/O stack [8, 43, 22, 11]. But unfortunately, the

bandwidth beneﬁt, which is mostly contributed by the

∗

Corresponding author: Jiwu Shu ([email protected]).

internal parallelism of ﬂash devices [13, 18, 10, 15], is

underutilized in system softwares.

Researchers have made great efforts in designing a ﬁle

system for ﬂash storage. New ﬁle system architectures

have been proposed, to either improve data allocation

performance, by removing redundant functions between

FS and FTL [20, 45], or improve ﬂash memory en-

durance, by using an object-based FTL to facilitate hard-

ware/software co-designs [33]. Features of ﬂash memory

have also been leveraged to redesign efﬁcient metadata

mechanisms. For instance, the imbalanced read/write

feature has been studied to design a persistence-efﬁcient

ﬁle system directory tree [32]. F2FS, a less aggressive

design than the above-mentioned designs, is a ﬂash-

optimized ﬁle system and has been merged into the Linux

kernel [28]. However, the internal parallelism has not

been well studied in these ﬁle systems.

Unfortunately, internal parallelism is a key design is-

sue to improve ﬁle system performance for ﬂash storage.

Even though it has been well studied in the FTL level,

ﬁle systems cannot always gain the beneﬁts. We observe

that, even though the ﬂash-optimized F2FS ﬁle system

outperforms the legacy ﬁle system Ext4 when write

trafﬁc is light, its performance does not scale well with

the internal parallelism when write trafﬁc is heavy (48%

of Ext4 in the worst case, more details in Section 4.3 and

Figure 6).

After investigating into ﬁle systems, we have the fol-

lowing three observations. First, optimizations in both

FS and FTL are made but may collide. For example, data

pages

are grouped into hot/cold groups in some ﬁle sys-

tems, so as to reduce garbage collection (GC) overhead.

FTL stripes data pages over different parallel units (e.g.,

channels) to achieve parallel performance, but breaks up

the hot/cold groups. Second, duplicated log-structured

data management in both FS and FTL leads to inefﬁcient

garbage collection. FS-level garbage collection is unable

In this paper, we use data pages instead of data blocks, in order

not to be confused with ﬂash blocks.

88 2016 USENIX Annual Technical Conference USENIX Association

to erase physical ﬂash blocks, while FTL-level garbage

collection is unable to select the right victim blocks due

to the lack of semantics. Third, isolated I/O scheduling

in either FS or FTL results in unpredictable I/O latencies,

which is a new challenge in storage systems on low-

latency ﬂash devices.

To exploit the internal parallelism while keeping low

garbage collection overhead, we propose ParaFS, a log-

structured ﬁle system over simpliﬁed block-level FTL.

The simpliﬁed FTL exposes the device information to

the ﬁle system. With the knowledge of the physical

layout, ParaFS takes the following three approaches to

exploit the internal parallelism. First, it fully exploits the

channel-level parallelism of ﬂash memory by page-unit

striping, while ensuring data grouping physically. Sec-

ond, it coordinates the garbage collection processes in

the FS and FTL levels, to make GC more efﬁcient. Third,

it optimizes the request scheduling on read, write and

erase requests, and gains more consistent performance.

Our major contributions are summarized as follows:

• We observe the internal parallelism of ﬂash devices

is underutilized when write trafﬁc is heavy, and pro-

pose ParaFS, a parallelism-aware ﬁle system over a

simpliﬁed FTL.

• We propose parallelism-aware mechanisms in the

ParaFS ﬁle system, to mitigate the parallelism’s

conﬂicts with hot/cold grouping, garbage collec-

tion, and I/O performance consistency.

• We implement ParaFS in the Linux kernel. Eval-

uations using various workloads show that ParaFS

has higher and more consistent performance with

signiﬁcant lower garbage collection overhead, com-

pared to legacy ﬁle systems including the ﬂash-

optimized F2FS.

The rest of this paper is organized as follows. Sec-

tion 2 analyses the parallelism challenges and mo-

tivates the ParaFS design. Section 3 describes the

ParaFS design, including the 2-D allocation, coordi-

nated garbage collection and parallelism-aware schedul-

ing mechanisms. Evaluations of ParaFS are shown in

Section 4. Related work is given in Section 5 and the

conclusion is made in Section 6.

2 Motivation

2.1 Flash File System Architectures

Flash devices are seamlessly integrated into legacy stor-

age systems by introducing a ﬂash translation layer

(FTL). Though NAND ﬂash has low access latency

and high I/O bandwidth, it has several limitations, e.g.,

erase-before-write and write endurance (i.e., limited pro-

gram/erase cycles). Log-structured design is supposed to

be a friendly way for ﬂash memory. It is adopted in both

the ﬁle system level (e.g., F2FS [28], and NILFS [27])

and the FTL level [10, 36]. The FTL hides the ﬂash

limitations by updating data in a log-structured way

using an address mapping table. It exports the same I/O

interface as HDDs. With the use of the FTL, legacy ﬁle

systems can directly run on these ﬂash devices, without

any changes. This architecture is shown in Figure 1(a).

FTL

READ / WRITE / TRIM

Log-structured File System

Namespace

Alloc.

Mapping

GC WL ECC

Channel N

Flash

…

Channel 0

Flash

Channel 1

Flash

Flash Memory

(a) Block Interface

Object-based FTL

GET / PUT

Object-based File System

Namespace

Alloc.

Mapping

GC WL ECC

Channel N

Flash

…

Channel 0

Flash

Channel 1

Flash

Flash Memory

(b) Object Interface

Figure 1: FS Architectures on Flash Storage

While FTL is good at abstracting the ﬂash memory

as a block device, it widens the semantic gap between

ﬁle systems and ﬂash memory. As shown in Figure 1(a),

the FS and FTL run independently without knowing

each other’s behaviours, resulting in inefﬁcient storage

management. Even worse, the two levels have redundant

functions, e.g., space allocation, address mapping, and

garbage collection, which further induce performance

and endurance overhead.

Object-based FTL (OFTL) [33] is a recently proposed

architecture to bridge the semantic gap between ﬁle

systems and FTLs, as shown in Figure 1(b). However,

it requires dramatic changes of current I/O stack, and

is difﬁcult to be adopted currently, either in the device

due to complexity or in the operating system due to rich

interfaces required in the drivers.

2.2 Challenges of Internal Parallelism

In a ﬂash device, a number of ﬂash chips are connected

to the NAND ﬂash controller through multiple chan-

nels. I/O requests are distributed to different channels

to achieve high parallel performance, and this is known

as the internal parallelism. Unfortunately, this feature

has not been well leveraged in the ﬁle systems for the

following three challenges.

Hot/Cold Grouping vs. Internal Parallelism. With

the use of the FTL, ﬁle systems allocate data pages in a

one-dimension linear space. The linear space works ﬁne

for HDDs, because a hard disk accesses sectors serially

due to the mechanical structure, but not for ﬂash devices.

USENIX Association 2016 USENIX Annual Technical Conference 89

In ﬂash-based storage systems, a ﬁne hot/cold grouping

reduces the number of valid pages in victim ﬂash blocks

during garbage collection. Some ﬁle systems separate

data into groups with different hotness for GC efﬁciency.

Meanwhile, the FTL trys to allocate ﬂash pages from

different parallel units, aiming at high bandwidth. Data

pages that belong to the same group in the ﬁle system

may be written to different parallel units. Parallel space

allocation in FTL breaks hot/cold data groups. As such,

the hot/cold data grouping in the ﬁle system collides with

the internal parallelism in the FTL.

Garbage Collection vs. Internal Parallelism. Log-

structured ﬁle systems are considered to be ﬂash friendly.

However, we observe that F2FS, a ﬂash-optimized log-

structured ﬁle system, performs worse than Ext4 on

multi-channel ﬂash devices when write trafﬁc is heavy.

To further analyse this inefﬁciency, we implement a

page-level FTL and collect the GC statistics in the FTL,

including the number of erased ﬂash blocks and the

percentage of invalid pages in each erased block (i.e.,

GC efﬁciency). Details of the page-level FTL and eval-

uation settings are introduced in Section 4. As shown in

Figure 2, for the random update workload in YCSB[14],

F2FS has increasing number of erased ﬂash blocks and

decreasing GC efﬁciency when the number of channels

increases.

The reason lies in the incoordinate garbage collections

in FS and FTL. When the log-structured ﬁle system

cleans a segment in the FS level, the invalid ﬂash pages

are actually scattered over multiple ﬂash parallel units.

The more channels the device has, the more diverse the

invalid pages are. This degrades the GC efﬁciency in the

FTL. At the same time, the FTL gets pages’ invalidation

only after receiving the trim commands from the log-

structured ﬁle system, due to the no in-place update.

As we observed in the experiments, a large number of

pages which are already invalidated in the ﬁle system are

migrated to the clean ﬂash blocks during the FTL’s erase

process. Therefore, garbage collections in FS and FTL

levels collide and damage performance.

(a) # of Recycled Blocks (b) GC Efﬁciency

Figure 2: GC Statistics under Heavy Trafﬁc

Consistent Performance vs. Internal Parallelism.

Read or write requests may be blocked by erase opera-

tions in the FTL. Since erase latency is much higher than

read/write latency, this causes latency spikes, making I/O

latency unpredictable. It is known as the inconsistent

performance, which is a serious issue for low-latency

storage devices. Parallel units offer an opportunity to

remove latency spikes by carefully scheduling I/O re-

quests. Once a ﬁle system controls read, write and erase

operations, it can dynamically schedule requests to make

performance more consistent.

To address the above-mentioned challenges, we pro-

pose a new architecture (as shown in Figure 3) to ex-

ploit the internal parallelism in the ﬁle system. In this

architecture, we use a simpliﬁed FTL to expose the

physical layout to the ﬁle system. Our designed ParaFS

maximizes the parallel performance in the ﬁle system

level while achieving physical hot/cold data grouping,

lower garbage collection overhead and more consistent

performance.

3 Design

ParaFS is designed to exploit the internal parallelism

without compromising other mechanisms. To achieve

this goal, ParaFS uses three key techniques:

• 2-Dimension Data Allocation to stripe data pages

over multiple ﬂash channels at page granularity

while maintaining hot/cold data separation physi-

cally.

• Coordinated Garbage Collection to coordinate

garbage collections in FS and FTL levels and lower

the garbage collection overhead.

• Parallelism-Aware Scheduling to schedule the

read/write/erase requests to multiple ﬂash channels

for more consistent system performance.

In this section, we introduce the ParaFS architecture ﬁrst,

followed by the description of the above-mentioned three

techniques.

3.1 The ParaFS Architecture

ParaFS reorganizes the functionalities between the ﬁle

system and the FTL, as shown in Figure 3. In the ParaFS

architecture, ParaFS relies on a simpliﬁed FTL (anno-

tated as S-FTL). S-FTL is different from traditional FTLs

in three aspects. First, S-FTL uses a static block-level

mapping table. Second, it performs garbage collection

by simply erasing the victim ﬂash blocks without moving

any pages, while the valid pages in the victim blocks

are migrated by ParaFS in the FS level. Third, S-FTL

exposes the physical layout to the ﬁle system using three

values, i.e., the number of ﬂash channels

, the ﬂash page

size and the ﬂash block size. These three values are

passed to the ParaFS ﬁle system through ioctl interface,

Currently, ParaFS exploits only the channel-level parallelism,

while ﬁner level (e.g., die-level, plane-level) parallelism can be ex-

ploited in the FTL by emulating large pages or blocks.

90 2016 USENIX Annual Technical Conference USENIX Association

READ / WRITE / ERASE

ParaFS

Namespace

S-FTL

Block Mapping WL ECC

Channel N

Flash

Channel 0

Flash

Channel 1

Flash

Flash Memory

2-D Allocation

Region 0 Region N

…

Parallelism-Aware Scheduler

Read

Write

Erase

Read

Req. Que. 0

…

Coordinated GC

Thread 0 Thread N

Req. Que. 1

Read

Write

Erase

Read

Req. Que. N

Read

Write

Erase

Read

…

Figure 3: The ParaFS Architecture

when the device is formatted. With the physical layout

information, ParaFS manages data allocation, garbage

collection, and read/write/erase scheduling in the ﬁle

system level. The ParaFS ﬁle system uses an I/O inter-

face that consists of read/write/erase commands, which

remains the same as the legacy block I/O interface. In

ParaFS, the erase command reuses the trim command in

the legacy protocol.

In S-FTL, the block mapping is different from those

in legacy block-based FTLs. S-FTL uses static block-

level mapping. For normal writes, the ParaFS ﬁle system

updates data pages in a log-structured way. There is no

in-place update in the FS level. For these writes, S-

FTL doesn’t need to remap the block. S-FTL updates

its mapping entries only for wear leveling or bad block

remapping. In addition, ParaFS tracks the valid/invalid

statuses of pages in each segment, whose address is

aligned to the ﬂash block in ﬂash memory. The ﬁle sys-

tem migrates valid pages in the victim segments during

garbage collection, and the S-FTL only needs to erase

the corresponding ﬂash blocks afterwards. As such, the

simpliﬁed block mapping, simpliﬁed garbage collection

and reduced functionalities make S-FTL a lightweight

implementation, which has nearly zero overhead.

The ParaFS ﬁle system is a log-structured ﬁle system.

To avoid the “wandering tree” problem [38], ﬁle system

metadata is updated in place, by introducing a small

page-level FTL. Since we focus on the data management,

we omit discussions of this page-level FTL for the rest

of the paper. In contrast, ﬁle system data is updated

in a log-structured way and sent to the S-FTL. The

ParaFS ﬁle system is able to perform more effective

data management with the knowledge of the internal

physical layout of ﬂash devices. Since the ﬁle system

is aware of the channel layout, it allocates space in

different channels to exploit the internal parallelism and

meanwhile keeps the hot/cold data separation physically

(details in Section 3.2). ParaFS aligns the segment of

log-structured ﬁle system to the ﬂash block of the device.

Since it knows exactly the valid/invalid statuses of data

pages, it performs garbage collection in the ﬁle system

with the coordination of the FTL erase operations (details

in Section 3.3). With the channel information, ParaFS is

also able to optimize the scheduling on read/write/erase

requests in the ﬁle system and make system performance

more consistent (details in Section 3.4).

3.2 2-D Data Allocation

The ﬁrst challenge as described in Section 2.2 is the con-

ﬂict between data grouping in the ﬁle system and internal

parallelism in the ﬂash device. Intuitively, the easiest

way for ﬁle system data groups to be aligned to ﬂash

blocks, is using block granularity striping in the FTL.

The block striping maintains data grouping while paral-

lelizing data groups to different channels in block units.

However, the block-unit striping fails to fully exploit

the internal parallelism, since the data within one group

needs to be accessed sequentially in the same ﬂash block.

We observe in our systems that block-unit striping has

signiﬁcantly lower performance than page-unit striping

(i.e., striping in page granularity), especially in the small

synchronous write situations, like mail server. Some co-

designs employ a super block unit which contains ﬂash

blocks from different ﬂash channels [42, 12, 29]. The

writes can be striped in a page granularity within the

super block and different data groups are maintained in

different super blocks. However, the large recycle size of

the super block incurs higher overhead in the garbage

collection. Therefore, we propose 2-D allocation in

ParaFS that uses small allocation units to fully exploit

channel-level parallelism while keeping effective data

grouping. Tabel 1 summaries the characteristics of these

data allocation schemes and compares them with 2-D

allocation used in ParaFS.

Figure 4 shows the 2-D allocation in ParaFS. With the

device information from S-FTL, ParaFS is able to allo-

cate and recycle space that aligned to the ﬂash memory

below. The data space is divided into multiple regions. A

region is an abstraction of a ﬂash channel in the device.

The number and the size of regions equal to those of

ﬂash channels. Each region contains multiple segments

and each segments contains multiple data pages. The

segment is the unit of allocation and garbage collection

USENIX Association 2016 USENIX Annual Technical Conference 91

Table 1: Data Allocation Schemes

Parallelism Garbage Collecion

Stripe Granularity Parallelism Level GC Granulariy Grouping Effect GC Overhead

Page Stripe Page High Block No High

Block Stripe Block Low Block Ye s Low

Super Block Page High Multiple Blocks Yes Medium

2-D Allocation Page High Block Yes Low

Alloc.

Space

Free

Space

Alloc. Head 0

Alloc. Head 1

…

Alloc. Head m

…

Region 0

…

Free Seg. List

Log-structured Segment Written Data Page

…

Region 1

…

Region N

…

Figure 4: 2-D Allocation in ParaFS

in ParaFS. Its size and address are aligned to the physical

ﬂash block. The data page in the segments represents

the I/O unit to the ﬂash and is aligned to the ﬂash page.

There are multiple allocator heads and a free segment

list in each region. The allocator heads point to data

groups with different hotness within the region. They

are assigned a number of free segments and they allocate

free pages to the written data with different hotness, in a

log-structured way. The free segments of the region are

maintained in the free segment list.

The 2-D allocation consists of channel-level dimen-

sion and hotness-level dimension. ParaFS delays space

allocation until data persistence. Updated data pages

are ﬁrst buffered in memory. When a write request is

going to be dispatched to the device, ParaFS starts the

2-D scheme to allocate free space for the request.

First, in the channel-level dimension, ParaFS divides

the write request into pages, and then stripes these pages

over different regions. Since the regions match the ﬂash

channels one-to-one, the data in the write request is sent

to different ﬂash channels in a page-size granularity, so

as to exploit the channel-level parallelism in the FS level.

After dividing and striping the data of the write request

over regions, the allocation process goes to the second

dimension.

Second, in the hotness-level dimension, ParaFS

groups data pages into groups with different hotness in

a region, and sends the divided write requests to their

proper groups. There are multiple allocator heads with

different hotness in the region. The allocator head,

which has similar hotness to the written data, is selected.

Finally, the selected allocator head allocates a free page

to the written data. The data is sent to the device with

the address of the allocated page. Since the segments

of allocator heads are aligned to the ﬂash blocks, the

data pages with different hotness are assigned to different

allocator heads, thereby placed in separated ﬂash blocks.

The hotness-level dimension ensures the effective data

grouping physically.

In implementation, ParaFS uses six allocators in each

region. The data is divided into six kinds of hotness,

i.e., hot, warm, and cold, respectively for metadata and

regular data. The hotness classiﬁcation uses the same

hints as in F2FS [28]. Each allocator is assigned ten

segments each time. The address of the segments and

the offset of the allocator heads are recorded in the FS

checkpoint in case of system crash.

Crash Recovery. After the crash, legacy log-

structured ﬁle systems will recover itself to the last

checkpoint, which maintains the latest consistent state

of the ﬁle system. Some ﬁle systems[33][28] will do

the roll-forward recovery to restore changes after last

checkpoint. ParaFS differs slightly during the recovery.

Since ParaFS directly accesses and erases ﬂash memory

through statically mapped S-FTL, it has to detect the

written pages after last checkpoint. These written pages

need to be detected so that the ﬁle system does not

overwrite these pages, which otherwise causes overwrite

errors in ﬂash memory.

To solve this problem, ParaFS employs an update

window similar to that in OFSS [33]. An update window

consists of segments from each region that are assigned

to allocator heads after last checkpoint. Before a set of

segments are assigned to allocator heads, their addresses

are recorded ﬁrst. During recovery, ParaFS ﬁrst recovers

itself to the last checkpoint. Then, ParaFS does the

roll-forward recovery with the segments in the update

window in each region, since all written pages after last

checkpoint fall in this window. After recovering the valid

pages in the window, other pages will be considered as

invalid, and they will be erased by the GC threads in the

future.

3.3 Coordinated Garbage Collection

The second challenge as described in Section 2.2 is

the mismatch of the garbage collection processes in FS

and FTL levels. Efforts in either side are ineffective

in reclaiming free space. ParaFS coordinates garbage

collection in the two levels and employs multiple GC

threads to leverage the internal parallelism. This tech-

92 2016 USENIX Annual Technical Conference USENIX Association

nique improves both the GC efﬁciency (i.e., the percent

of invalid pages in victim blocks) and the space reclaim

efﬁciency (i.e., time used for GC).

The coordinated garbage collection contains the GC

process from FS level and FTL level. We use the

terms “paraGC” and “ﬂashGC” to distinguish them. In

the FS level, the paraGC is triggered when free space

drops below a threshold (i.e., foreground paraGC) or ﬁle

system is idle (i.e., background paraGC). The unit of

garbage collection is the segment, as we described in

Section 3.2. The foreground paraGC employs greedy

algorithm to recycle the segments quickly and minimize

the latency of I/O threads. The background paraGC uses

cost-beneﬁt algorithm [38] that selects victim segments

not only based on the number of invalid data pages, but

also their “age”. When the paraGC thread is triggered, it

ﬁrst selects victim segments using the algorithms above.

Then, the paraGC thread migrates the valid pages in

the victim segments to the free space. If the migration

is conducted by foreground paraGC, these valid pages

are striped to the allocator heads with similar hotness

in different regions. If the migration is conducted by

background paraGC, these valid pages are considered

to be cold, and striped to the allocator heads with low

hotness. After the valid pages are written to the device,

the victim segments are marked erasable and they will be

erased after checkpoint. The background paraGC exits

after the migration while the foreground paraGC does

the checkpoint and sends the erase requests to S-FTL by

trim.

In the FTL level, the block recycling is simpliﬁed,

since the space management and garbage collection are

moved from FTL level to FS level. The segments in

the FS level match the ﬂash blocks one to one. After

paraGC migrates valid pages in the victim segments,

the corresponding ﬂash blocks can be erased without

additional copies. When S-FTL receives trim commands

from the ParaFS, ﬂashGC locates the ﬂash blocks that

victim segments mapped to, and directly erases them.

After the ﬂash blocks are erased, S-FTL informs the

ParaFS by the callback function of the requests. The

coordinated GC migrates the valid pages in FS level,

and invalidates the whole ﬂash block to the S-FTL. No

migration overhead is involved during the erase process

in the S-FTL.

Coordinated GC also reduces the over-provisioning

space of the device and brings more user available capac-

ity. Traditional FTLs need to move the valid pages from

the victim blocks before erasing. They keeps large over-

provisioning space to reduce the overhead of garbage

collection. The spared space in FTL decreases the

user visible capacity. Since the valid pages are already

moved by ParaFS, the ﬂashGC in S-FTL can directly

erase the victim blocks without any migration. S-FTL

needn’t spare space for garbage collection. The over-

provisioning space of S-FTL is much smaller than the

traditional ones. The spared blocks are only used for bad

block remapping and block mapping table storage. S-

FTL also needn’t track and maintain the ﬂash page status

and ﬂash block utilization, which also reduces the size of

spared space.

ParaFS optimizes the foreground GC by employing

multiple GC threads. Legacy log-structured ﬁle systems

perform foreground GC in one thread [28], or none [27].

Since all operations are blocked during checkpointing.

Multiple threads cause frequent checkpoint and decrease

the performance severely. However, one or less fore-

ground GC thread is not enough under write intensive

workloads which consume the free segments quickly.

ParaFS assigns one GC thread to each region and em-

ploys an additional manager thread. The manager checks

the utilization of each region, and wakes up the GC

thread of the region when it’s necessary. After GC

threads migrate the valid data pages, the manager will

do the checkpoint, and send the erase requests asyn-

chronously. This optimization avoids multiple check-

points, and accelerates the segment recycling under

heavy writes.

3.4 Parallelism-Aware Scheduling

The third challenge as described in Section 2.2 is the

performance spikes. To address this challenge, ParaFS

proposes to schedule the read/write/erase requests in the

ﬁle system level while exploiting the internal parallelism.

In ParaFS, the 2-D allocation selects the target ﬂash

channels for the write requests, and the coordinated GC

manages garbage collection in the FS level. These two

techniques offer an opportunity for ParaFS to optimize

the scheduling on read, write and erase requests. ParaFS

employs parallelism-aware scheduling to provide more

consistent performance under heavy write trafﬁc.

Parallelism-aware scheduling consists of request dis-

patching phase and request scheduling phase. In the

dispatching phase, it optimizes the write requests. The

scheduler maintains a request queue for each ﬂash chan-

nel of the device, shown in Figure 3. The target ﬂash

channels of read and erase requests are ﬁxed, while

the target channel of writes can be adjusted due to the

late allocation. In the channel-level dimension of 2-

D allocation, ParaFS splits the write requests into data

pages, and selects a target region for each page. The

region of ParaFS and the ﬂash channel below are one-

to-one correspondence. Instead of a Round-Robin se-

lection, ParaFS selects the region to write, whose corre-

sponding ﬂash channel is the least busy in the scheduler.

Due to the asymmetric read and write performance of

the ﬂash drive, the scheduler assigns different weights

USENIX Association 2016 USENIX Annual Technical Conference 93

read

, W

write

) to read and write requests in the request

queues. The weights are obtained by measuring the

corresponding latency of the read and write requests.

Since the write latency is 8× of the read latency in

our device, the W

read

and W

write

are set to 1 and 8 in

the evaluation. The channel with the smallest weight

calculated by the Formula 1 will be selected. Size

read

and Size

write

represent the size of read and write requests

in the request queue. The erase requests in the request

queue are not considered in the formula, because the

time, when they are sent to the device, is uncertain.

channel

∑

read

× Size

read

write

× Size

write

) (1)

In the scheduling phase, the scheduler optimizes the

erase requests scheduling. Considering the fairness, the

parallelism-aware scheduler assigns a time slice for read

requests and a same-size slice for write/erase requests.

The scheduler works on each request queue individually.

In the read slice, the scheduler schedules read requests

to the channel. When the slice ends or no read request

is left in the queue, the scheduler determines to schedule

write or erase request in next slice by Formula 2. The

f is the percent of free blocks in the ﬂash channel. N

is the percent of ﬂash channels that are processing the

erase requests at this moment. The a and b represents the

weight of these parameters.

e = a × f + b ×N

(2)

If e is higher than 1, the scheduler sends write requests

in the time slice. Otherwise, the scheduler sends erase

requests in that time slice. After the write/erase slice,

the scheduler turns to read slice again. In current im-

plementation, a and b are respectively set to 2 and 1,

which implies that erase requests are scheduled only

after the free space drops below 50%. This scheme

gives higher priority to erase requests when the free

space is not enough. Meanwhile, it prevents too many

channels erasing at the same time, which helps to ease

the performance wave under heavy write trafﬁc.

With the employment of these two optimizations, the

parallelism-aware scheduler helps to provide both high

and consistent performance.

4 Evaluation

In this section, we evaluate ParaFS to answer the follow-

ing three questions:

1. How does ParaFS perform compared to other ﬁle

systems under light write trafﬁc?

2. How does ParaFS perform under heavy write traf-

ﬁc? And, what are the causes behind?

3. What are the beneﬁts respectively from the pro-

posed optimizations in ParaFS?

4.1 Experimental Setup

In the evaluation, we compare ParaFS with Ext4 [2],

BtrFS [1] and F2FS [28], which respectively represent

the in-place-update, copy-on-write, and ﬂash-optimized

log-structured ﬁle systems. ParaFS is also compared to

a revised F2FS (annotated as F2FS

SB) in addition to

conventional F2FS (annotated as F2FS). F2FS organizes

data into segment, which is the GC unit in FS and is the

same size as ﬂash block. F2FS

SB organizes data into

super blocks. A super block consists of multiple adjacent

segments. The number of the segments equals to the

number of channels in the ﬂash device. In F2FS

SB, a

super block is the unit of allocation and garbage collec-

tion, and can be accessed in parallel.

We customize a raw ﬂash device to support pro-

grammable FTLs. Parameters of the ﬂash device are

listed in Table 2. To support the ﬁle systems above,

we implement a page-level FTL, named PFTL, based on

DFTL [17] with lazy indexing technique [33]. PFTL

stripes updates over different channels in a page size

unit, to fully exploit the internal parallelism. S-FTL is

implemented based on PFTL. It removes the allocation

function, replaces the page-level mapping with static

block-level mapping, and simpliﬁes the GC process, as

described in Section 3.1. In both FTLs, the number

of ﬂash channels and the capacity of the device are

conﬁgurable. With the help of these FTLs, we collect

the information about ﬂash memory operations, like the

number of erase operations and the number of pages

migrated during garbage collection. The low-level in-

formation is helpful for comprehensive analysis of ﬁle

system implications.

Table 2: Parameters of the Customized Flash Device

Host Interface PCIe 2.0 x8

Number of Flash Channel 34

Capacity per Channel 32G

NAND Type 25nm MLC

Page Size 8KB

Block Size 2MB

Read Bandwidth per Channel 49.84 MB/s

Write Bandwidth per Channel 6.55 MB/s

The experiments are conducted on an X86 server

with Intel Xeon E5-2620 processor, clocked at 2.10GHz,

and 8G memory of 1333MHz. The server runs with

Linux kernel 2.6.32, which is required by the customized

ﬂash device. For the target system, we back-port F2FS

from the 3.15-rc1 main-line kernel to the 2.6.32 kernel.

ParaFS is implemented as a kernel module based on

F2FS, but differs in data allocation, garbage collection,

and I/O scheduling.

Workloads. Table 3 summarizes the four workloads

used in the experiments. Two of them run directly on ﬁle

94 2016 USENIX Annual Technical Conference USENIX Association

Table 3: Workload Characteristics

Name Description # of Files I/O size Threads R/W fsync

Fileserver File server workload: random read and write ﬁles 60,000 1MB 50 33/66 N

Postmark Mail server workload: create, delete, read and append ﬁles 10,000 512B 1 20/80 Y

MobiBench SQLite workload: random update database records N/A 4KB 10 1/99 Y

YCSB MySQL workload: read and update database records N/A 1KB 50 50/50 Y

(a) channel = 8 (b) channel = 32

Figure 5: Performance Evaluation (Light Write Trafﬁc)

systems and the other two run on databases. Fileserver

is a typical pre-deﬁned workload in Filebench [3] to

emulate the I/O behaviors in ﬁle servers. It creates,

deletes and randomly accesses ﬁles in multiple threads.

Postmark [26] emulates the behavior of mail servers.

Transactions, including create, delete, read and append

operations, are performed to the ﬁle systems. Mo-

bibench [19] is a benchmark tool for measuring the per-

formance of ﬁle IO and DB operations. In the evaluation,

it issues random update operations to the SQLite[9],

which runs with FULL synchronous and WAL journal

mode. YCSB [14] is a framework to stress many popular

data serving systems. We use the workload-A of YCSB

on MySql [7], which consists of 50% random reads and

50% random updates.

4.2 Evaluation with Light Write Trafﬁc

In this section, we compare ParaFS with other ﬁle sys-

tems under light write trafﬁc. We choose 8 and 32

channels for this evaluation, which respectively represent

SATA SSDs [6, 4, 5] and PCIe SSDs [5, 41]. The device

capacity is set to 128GB.

Figure 5 shows the throughput of evaluated ﬁle sys-

tems, and results are normalized against F2FS’s perfor-

mance in 8-channel case. From the ﬁgure, we have two

observations.

(1) ParaFS outperforms other ﬁle systems in all cases,

and achieves 13% higher over F2FS for postmark work-

load in the 32-channel case. In the evaluated ﬁle systems,

BtrFS performs poorly, especially for database bench-

marks that involve frequent syncs. The update prop-

agation (i.e., “wandering tree” problem [38]) of copy-

on-write brings intensive data writes in BtrFS during

the sync calls (7.2× for mobibench, 3.0× for YCSB).

F2FS mitigates this problem by updating the metadata

in place [28]. Except BtrFS, other ﬁle systems perform

roughly similar under light write trafﬁc. F2FS only

outperforms Ext4 by 3.5% in ﬁleserver with 32 channels,

which is consistent to the F2FS evaluations on PCIe

SSD[28]. The performance bottleneck appears to be

moved, due to the fast command processing and high

random access ability of the PCIe drive. F2FS

shows nearly the same performance to F2FS. Since PFTL

stripes the requests over all ﬂash channels with a page-

size unit, larger allocation unit in F2FS

SB doesn’t gain

more beneﬁt in parallelism. The GC impact of larger

block recycling is also minimized due to the light write

pressure. The impact will be seen under heavy write traf-

ﬁc evaluations in Section 4.3. ParaFS uses cooperative

designs in both FS and FTL levels, eliminates the dupli-

cate functions, and achieves the highest performance.

(2) Performance gains in ParaFS grow when the num-

ber of channels is increased. Comparing the two ﬁgures

in Figure 5, all ﬁle systems have their performance

improved when the number of channels is increased.

Among them, ParaFS achieves more. It outperforms

F2FS averagely by 1.53% in 8-channel cases and 6.29%

in 32-channel cases. This also evidences that ParaFS

spends more efforts in exploiting the internal parallelism.

In all, ParaFS has comparable or better performance

than the other evaluated ﬁle systems when the write

trafﬁc is light.

4.3 Evaluation with Heavy Write Trafﬁc

Since ParaFS is designed to address the problems of

data grouping and garbage collection while exploiting

USENIX Association 2016 USENIX Annual Technical Conference 95

(a) Fileserver (b) Postmark

Figure 6: Performance Evaluation (Heavy Write Trafﬁc)

the internal parallelism, evaluations using heavy write

trafﬁc are much more important. In this evaluation, we

limit the capacity of ﬂash device to 16GB and increase

the benchmark sizes. The write trafﬁc sizes in the four

evaluated benchmarks are set to 2×∼3× of the ﬂash

device capacity, to trigger the garbage collection actions.

4.3.1 Performance

Figure 6 shows the throughput of evaluated ﬁle systems

for the heavy write trafﬁc cases, with the number of

channels varied from 1 to 32. Results are normalized

against F2FS’s throughput in 1-channel case. From the

ﬁgure, we have three observations.

(1) Ext4 outperforms F2FS for three out of the four

evaluated workloads, which is different from that in

the light write trafﬁc evaluation. The performance gap

between Ext4 and F2FS tends to be wider with more

ﬂash channels. The reason why the ﬂash-optimized F2FS

ﬁle system has worse performance than Ext4 is the side

effects of internal parallelism. In F2FS, the hot/cold data

grouping and the aligned segments are broken when data

pages are distributed to multiple channels in the FTL.

Also, the invalidation of a page is known in the FTL

only after it is recycled in the F2FS, due to the no in-

place update. Unfortunately, a lot of invalid pages have

been migrated during garbage collection before their

statuses are passed to the FTL. Both reasons lead to

high GC overhead in FTL, and the problem gets more

serious with increased parallelism. In Ext4, the in-

place update pattern is more accurate in telling FTLs the

page invalidation than F2FS. The exceptional case is the

postmark workload, which contains a lot of create and

delete operations. Ext4 spreads the inode allocation in

all of the block groups for load balance. This causes the

invalid pages distributed evenly in the ﬂash blocks and

results in higher garbage collection overhead than F2FS

(18.5% higher on average). In general, the observation

that ﬂash-optimized ﬁle system is not good at exploiting

internal parallelism under heavy write trafﬁc motivates

our ParaFS design.

(2) F2FS

SB shows improved performance than F2FS

for the four evaluated workloads, and the improvement

grows with more channels. This is also different from

results in the light write trafﬁc evaluation. The perfor-

mance of F2FS improves quickly when the number of

channels is increased from 1 to 8, but the improvement

is slowed down afterward. For ﬁleserver, postmark and

YCSB workloads, F2FS gains little improvement in the

32-channel case over the 16-channel case. The main

cause is the increasing GC overhead, which will be seen

in next section. In contrast, allocation and recycling

in super block units of F2FS

SB ease the GC overhead

caused by unaligned segments. The trim commands sent

by F2FS

SB contain larger address space, and are more

effective in telling the invalid pages to the FTL. However,

selecting victims in larger units also decreases the GC

efﬁciency. As such, the internal parallelism using super

block methods [12, 42] is still not effective.

(3) ParaFS outperforms other ﬁle systems in all cases.

96 2016 USENIX Annual Technical Conference USENIX Association

(a) Recycled Block Count (b) GC Efﬁciency

Figure 7: Garbage Collection and Write Trafﬁc Evaluation (Fileserver)

ParaFS outperforms Ext4 from 1.0× to 1.7× in the 8-

channel case, and to 2.5× in the 32-channel case. ParaFS

outperforms F2FS from 1.6× to 1.9× in the 8-channel

case, and from 1.7× to 3.1× in the 32-channel case.

ParaFS outperforms F2FS

SB from 1.2× to 1.5× in

both cases. ParaFS keeps the aligned ﬂash block erase

units while using page-unit striping in 2-D allocation.

The coordinated multi-threaded GC is also helpful in

reducing the GC overhead. And thus, ParaFS is effective

in exploiting the internal parallelism under heavy write

trafﬁc.

4.3.2 Write Trafﬁc and Garbage Collection

To further understand the performance gains in ParaFS,

we collect the statistics of garbage collection and write

trafﬁc from both FS and FTL levels. We select the

ﬁleserver workload as an example due to space limita-

tion. The other workloads have similar patterns and are

omitted. For fairness, we revise the ﬁleserver benchmark

to write ﬁxed-size data. In the evaluation, the write trafﬁc

size in ﬁleserver is set to 48GB, and the device capacity

is 16GB.

Figure 7(a) and Figure 7(b) respectively give the re-

cycled block count and the garbage collection efﬁciency

of evaluated ﬁle systems with varied number of ﬂash

channels. The recycled block count is the number of

ﬂash blocks that are erased in ﬂash memory. The GC

efﬁciency is measured using the average percentage of

invalid pages in a victim ﬂash block.

ParaFS has the lowest garbage collection overhead and

highest GC efﬁciency among all evaluated ﬁle systems

under four benchmarks. For the ﬁleserver evaluation, it

achieves the lowest recycled block count (62.5% of F2FS

on average) and the highest GC efﬁciency (91.3% on av-

erage). As the number of channels increases, the number

of recycled blocks in F2FS increases quickly. This is

due to the unaligned segments and uncoordinated GC

processes of both sides (as analyzed in Section 4.3.1).

It is also explained with the GC efﬁciency degradation

in Figure 7(b). The GC efﬁciency of Ext4, BtrFS,

ParaFS trends to drop a little with more ﬂash channels.

Because the adjacent pages are more scattered when the

device internal parallelism increases, and they tend to be

invalidated together. F2FS

SB acts different from other

ﬁle systems. In F2FS

SB, when the number of channels

is increased from 8 to 32, the number of recycled blocks

decreases and the GC efﬁciency increases. The reason

is that the super block has better data grouping and

alignments, and this advantage becomes increasingly

evident with higher degree of parallelism. F2FS

SB also

triggers FS-level GC threads more frequently with larger

allocation and GC unit. More trim commands with larger

address space help to decrease the number of invalid

pages migrated by GC process in the FTL level. ParaFS

further utilizes ﬁne-grained data grouping and GC unit,

and has the lowest garbage collection overhead.

Figure 7(c) shows the write trafﬁc that ﬁle systems

write to FTLs. Figure 7(d) shows the write trafﬁc

USENIX Association 2016 USENIX Annual Technical Conference 97

Figure 8: Performance Consistency Evaluation

that FTLs write to the ﬂash memory. Write trafﬁc in

either level comes from not only the ﬁle system data

and metadata but also the page migrated during garbage

collection in FS or FTL level. The write trafﬁc from

FS in Ext4, BtrFS and F2FS are stable for different

parallelism levels, because the increase of the ﬂash chan-

nels in the device is transparent to them. ParaFS writes

more in the ﬁle system than Ext4, F2FS and F2FS

SB,

but less than them in the FTL. Because all the page

migration during the garbage collection is done in the

FS level. Similarly, F2FS

SB has higher write trafﬁc

from FS and lower from FTL than F2FS, as the number

of channels is increased from 8 to 32. This results

from the improved GC efﬁciency as mentioned above.

The FTL write trafﬁc in F2FS is higher than F2FS

SB,

which explains why F2FS

SB has better performance

than F2FS in Figure 6. ParaFS coordinates garbage

collections in the two levels, and is more effective in

space recycling. Compared to F2FS, ParaFS decreases

the write trafﬁc to the ﬂash memory by 31.7% ∼ 54.7%

in 8-channel case, and 37.1% ∼ 58.1% in 32-channel

case. Compared to F2FS

SB, ParaFS decreases it by

14.9% ∼ 48.4% in 8-channel case, and 15.7% ∼ 32.5%

in 32-channel case.

4.4 Performance Consistency Evaluation

To evaluate the performance consistency in ParaFS, we

monitor the throughput wave during each run of exper-

iments. ParaFS aims at more consistent performance

using multi-threaded GC and parallelism-aware schedul-

ing. In this evaluation, we use four versions of ParaFS.

The baseline (annotated as ParaFS

Base) is the ParaFS

version without the above-mentioned two optimizations.

ParaFS

PS and ParaFS MGC respectively stand for the

version with parallelism-aware scheduling and multi-

threaded GC. ParaFS is the fully-functioned version.

The ﬁleserver and postmark have different phases

in the evaluation, which also cause ﬂuctuation in the

aggregate throughput (in terms of IOPS). We choose

mobibench as the candidate for performance consistency

evaluation. Mobibench performs random asynchronous

reads and writes to pre-allocated ﬁles. In this evaluation,

the write trafﬁc of mobibench is set to be 2× of the

device capacity.

Figure 8 shows the process of each run using four

versions of ParaFS. We compare ParaFS

MGC and

ParaFS

Base to analyse the impact of multi-threaded

GC. ParaFS

MGC and ParaFS Base have similar perfor-

mance in the ﬁrst 1000 seconds, during which no garbage

collection is involved. After that, the performance of

ParaFS

MGC waves. The performance peaks appear

after the GC starts in multiple threads. ParaFS

MGC

ﬁnishes the experiments earlier than ParaFS

Base by

18.5%.

The parallelism-aware scheduling contains two major

methods, the write request dispatching and the erase

request scheduling. The effectiveness of write request

dispatching can be seen by comparing ParaFS

PS and

ParaFS

Base. For the ﬁrst 1000 seconds when there is no

garbage collection, ParaFS

PS outperforms ParaFS Base

by nearly 20%. This beneﬁt comes from the write

dispatching in the parallelism-aware scheduling tech-

nique, which allocates pages and sends requests to

the least busy channels. The effectiveness of erase

request scheduling can be observed between ParaFS

and ParaFS

MGC. In the latter part of each run when

the GC processes are frequently triggered, ParaFS us-

ing parallelism-aware scheduling performs more consis-

tently than ParaFS

MGC.

In conclusion, the FS-level multi-threaded garbage

collection as implemented in ParaFS is more effective

in reclaiming free space, and the FS-level parallelism-

aware scheduling makes performance more consistent.

5 Related Work

Data Allocation. In the FTL or ﬂash controller, in-

ternal parallelism has been extensively studied. Recent

researches [13, 18, 21] have conducted extensive ex-

periments on page allocation with different levels (i.e.,

channel-level, chip-level, die-level and plane-level) of in-

ternal parallelism. Gordon [12] introduces a 2-D striping

to leverage both channel-level and die-level parallelism.

98 2016 USENIX Annual Technical Conference USENIX Association

But note that, 2-D striping in Gordon is different from

2-D data allocation in ParaFS. 2-D striping is designed

in the ﬂash controller, which places data to leverage

multiple levels of parallelism. 2-D data allocation is

designed in ﬁle system, which organizes data into dif-

ferent groups using metrics of parallelism and hotness.

In addition, aggressive parallelism in the device level

scatters data addresses, breaking up the data organization

in the system level. P-OFTL [42] has pointed out this

problem and found that increased parallelism leads to

higher garbage overhead, which in turn can decrease the

overall performance.

In the ﬁle system level, DirectFS [20] and Nameless

Write [45] propose to remove data allocation functions

from ﬁle systems and leverage the data allocations in

the FTL or storage device, which can have better de-

cisions with detailed knowledge of hardware internals.

OFSS [33] proposes an object-based FTL, which enables

hardware/software codesign with both knowledge of ﬁle

semantics and hardware internals. However, these ﬁle

systems pay little attention to internal parallelism, which

is the focus of ParaFS in this paper.

Garbage Collection. Garbage collection has a strong

impact on system performance for log-structured de-

signs. Researchers are trying to pass more ﬁle semantics

to FTLs to improve GC efﬁciency. For instance, trim is a

useful interface to inform FTLs the data invalidation, in

order to reduce GC overhead in migrating invalid pages.

Also, Kang et al. [25] found that FTLs can have more

efﬁcient hot/cold data grouping, which further reduces

GC overhead, if the expected lifetime of written data is

passed from ﬁle systems to FTLs. In addition, Yang et

al. [44] found log-structured designs in both levels of

system software and FTLs have semantic gaps, which

make garbage collection in both levels inefﬁcient. In

contrast, ParaFS proposes to bridge the semantic gap and

coordinate garbage collection in the two levels.

In the ﬁle system level, SFS [34] uses a sophisticated

hot/cold data grouping algorithm using both access count

and age of the block. F2FS [28] uses a static data group-

ing method according to the ﬁle and data types. How-

ever, these grouping algorithms suffer when grouped

data are spread out in the FTL. Our proposed ParaFS

aims to solve this problem and keep physical hot/cold

grouping while exploiting the internal parallelism.

A series of research works use large write block size to

align the ﬂash block and decrease the GC overhead[40,

31, 30, 35]. RIPQ[40] and Pannier[31] aggregate small

random writes in memory, divide them into groups ac-

cording to the hotness, and evict the groups in ﬂash block

size. Nitro[30] deduplicates and compresses the writes in

RAM and evicts them in ﬂash block size. Nitro proposes

to modify the FTL to support block-unit striping that

ensures the effective of the block-size write optimization.

SDF[35] employs block-unit striping which is tightly

coupled with key-value workloads. ParaFS uses page-

size I/O unit and aims at ﬁle system workloads.

I/O Scheduling. In ﬂash storage, new I/O scheduling

policies have been proposed to improve utilization of

internal parallelism [24, 23, 16] or fairness [37, 39].

These scheduling policies are designed in the controller,

the FTL or the block layer. In these levels, addresses

of requests are determined. In comparison, system-

level scheduling can schedule write requests before data

allocation, which is ﬂexible.

LOCS [41] is a key-value store that schedules I/O

requests in the system level upon open-channel SSDs.

With the use of log-structured merge tree (LSM-tree),

data is organized into data blocks aligned to ﬂash blocks.

LOCS schedules the read, write and erase operations to

minimize the response time.

Our proposed ParaFS is a ﬁle system that schedules

I/O requests in the system level. Different from key-

value stores, ﬁle systems have irregular reads and writes.

ParaFS exploits the channel-level parallelism with page-

unit striping. Moreover, its goal is in making perfor-

mance more consistent.

6 Conclusion

ParaFS is effective in exploiting the internal parallelism

of ﬂash storage, while keeping physical hot/cold data

grouping and low garbage collection overhead. It also

takes the parallelism opportunity to schedule read, write

and erase requests to make system performance more

consistent. ParaFS’s design relies on ﬂash devices with

customized FTL that exposes physical layout, which can

be represented by three values. The proposed design

bridges the semantic gap between ﬁle systems and FTLs,

by simplifying FTL and coordinating functions of the

two levels. We implement ParaFS on a customized

ﬂash device. Evaluations show that ParaFS outperforms

the ﬂash-optimized F2FS by up to 3.1×, and has more

consistent performance.

Acknowledgments

We thank our shepherd Haryadi Gunawi and anonymous

reviewers for their feedbacks and suggestions. This work

is supported by the National Natural Science Foundation

of China (Grant No. 61232003, 61433008, 61502266),

the Beijing Municipal Science and Technology Com-

mission of China (Grant No. D151100000815003), the

National High Technology Research and Development

Program of China (Grant No. 2013AA013201), and

the China Postdoctoral Science Foundation (Grant No.

2015M580098).

USENIX Association 2016 USENIX Annual Technical Conference 99

References

[1] Btrfs. http://btrfs.wiki.kernel.org.

[2] Ext4. https://ext4.wiki.kernel.org/.

[3] Filebench benchmark. http://sourceforge.net/apps/mediawiki

/ﬁlebench.

[4] Intel dc s3500 480gb enterprise ssd review.

http://www.tweaktown.com/reviews/5534/intel-dc-s3500-

480gb-enterprise-ssd-review/index.html.

[5] Intel ssd 750 pcie ssd review.

http://www.anandtech.com/show/9090/intel-ssd-750-pcie-

ssd-review-nvme-for-the-client.

[6] Intel x25-m and x18-m mainstream sata solid-state drives.

ftp://download.intel.com/newsroom/kits/ssd/pdfs/X25-

34nm ProductBrief.pdf.

[7] Mysql. https://www.mysql.com/.

[8] FusionIO Virtual Storage Layer.

http://www.fusionio.com/products/vsl, 2013.

[9] SQLite. http://www.sqlite.org/, 2014.

[10] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. Man-

asse, and R. Panigrahy. Design tradeoffs for SSD performance.

In Proceedings of 2008 USENIX Annual Technical Conference

(USENIX ATC), Berkeley, CA, 2008. USENIX.

[11] M. Bjørling, J. Axboe, D. Nellans, and P. Bonnet. Linux block

io: introducing multi-queue ssd access on multi-core systems.

In Proceedings of the 6th International Systems and Storage

Conference (SYSTOR), page 22. ACM, 2013.

[12] A. M. Caulﬁeld, L. M. Grupp, and S. Swanson. Gordon:

Using ﬂash memory to build fast, power-efﬁcient clusters for

data-intensive applications. In Proceedings of the 14th Inter-

national Conference on Architectural Support for Programming

Languages and Operating Systems (ASPLOS), pages 217–228,

New York, NY, USA, 2009. ACM.

[13] F. Chen, R. Lee, and X. Zhang. Essential roles of exploiting

internal parallelism of ﬂash memory based solid state drives in

high-speed data processing. In Proceedings of the 17th IEEE

International Symposium on High Performance Computer Archi-

tecture (HPCA), pages 266–277. IEEE, 2011.

[14] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and

R. Sears. Benchmarking cloud serving systems with ycsb. In

Proceedings of the 1st ACM symposium on Cloud computing,

pages 143–154. ACM, 2010.

[15] C. Dirik and B. Jacob. The performance of pc solid-state

disks (ssds) as a function of bandwidth, concurrency, device

architecture, and system organization. In Proceedings of the

36th annual International Symposium on Computer Architecture

(ISCA). ACM, 2009.

[16] C. Gao, L. Shi, M. Zhao, C. J. Xue, K. Wu, and E. H. Sha.

Exploiting parallelism in i/o scheduling for access conﬂict min-

imization in ﬂash-based solid state drives. In Mass Storage

Systems and Technologies (MSST), 2014 30th Symposium on,

pages 1–11. IEEE, 2014.

[17] A. Gupta, Y. Kim, and B. Urgaonkar. DFTL: A ﬂash translation

layer employing demand-based selective caching of page-level

address mappings. In Proceedings of the 14th International Con-

ference on Architectural Support for Programming Languages

and Operating Systems (ASPLOS), pages 229–240, New York,

NY, USA, 2009. ACM.

[18] Y. Hu, H. Jiang, D. Feng, L. Tian, H. Luo, and S. Zhang.

Performance impact and interplay of ssd parallelism through

advanced commands, allocation strategy and data granularity. In

Proceedings of the International Conference on Supercomputing

(ICS), pages 96–107. ACM, 2011.

[19] S. Jeong, K. Lee, J. Hwang, S. Lee, and Y. Won. Framework

for analyzing android i/o stack behavior: from generating the

workload to analyzing the trace. Future Internet, 5(4):591–610,

2013.

[20] W. K. Josephson, L. A. Bongo, D. Flynn, and K. Li. DFS: A ﬁle

system for virtualized ﬂash storage. In Proceedings of the 8th

USENIX Conference on File and Storage Technologies (FAST),

Berkeley, CA, 2010. USENIX.

[21] M. Jung and M. Kandemir. An evaluation of different page

allocation strategies on high-speed ssds. In Proceedings of the 4th

USENIX conference on Hot Topics in Storage and File Systems,

pages 9–9. USENIX Association, 2012.

[22] M. Jung and M. Kandemir. Revisiting widely held ssd expecta-

tions and rethinking system-level implications. In Proceedings of

the ﬁfteenth international joint conference on Measurement and

modeling of computer systems (SIGMETRICS), pages 203–216.

ACM, 2013.

[23] M. Jung and M. T. Kandemir. Sprinkler: Maximizing resource

utilization in many-chip solid state disks. In High Performance

Computer Architecture (HPCA), 2014 IEEE 20th International

Symposium on, pages 524–535. IEEE, 2014.

[24] M. Jung, E. H. Wilson III, and M. Kandemir. Physically

addressed queueing (paq): improving parallelism in solid state

disks. In Proceedings of the 39th ACM/IEEE International

Symposium on Computer Architecture (ISCA), pages 404–415,

2012.

[25] J.-U. Kang, J. Hyun, H. Maeng, and S. Cho. The multi-

streamed solid-state drive. In Proceedings of the 6th USENIX

conference on Hot Topics in Storage and File Systems, pages 13–

13. USENIX Association, 2014.

[26] J. Katcher. Postmark: A new ﬁle system benchmark. Technical

report, Technical Report TR3022, Network Appliance, 1997.

[27] R. Konishi, Y. Amagai, K. Sato, H. Hifumi, S. Kihara, and

S. Moriai. The linux implementation of a log-structured ﬁle

system. ACM SIGOPS Operating Systems Review, 40(3):102–

107, 2006.

[28] C. Lee, D. Sim, J. Hwang, and S. Cho. F2FS: A new ﬁle system

for ﬂash storage. In Proceedings of the 13th USENIX Conference

on File and Storage Technologies (FAST), Santa Clara, CA, Feb.

2015. USENIX.

[29] S. Lee, M. Liu, S. Jun, S. Xu, J. Kim, and Arvind. Application-

managed ﬂash. In Proceedings of the 14th Usenix Conference

on File and Storage Technologies (FAST), pages 339–353, Santa

Clara, CA, 2016. USENIX Association.

[30] C. Li, P. Shilane, F. Douglis, H. Shim, S. Smaldone, and G. Wal-

lace. Nitro: A capacity-optimized ssd cache for primary storage.

In Proceedings of 2014 USENIX Annual Technical Conference

(USENIX ATC), pages 501–512, Philadelphia, PA, June 2014.

USENIX Association.

[31] C. Li, P. Shilane, F. Douglis, and G. Wallace. Pannier: A

container-based ﬂash cache for compound objects. In Proceed-

ings of the 16th Annual Middleware Conference, pages 50–62,

Vancouver, Canada, 2015. ACM.

[32] Y. Lu, J. Shu, and W. Wang. ReconFS: A reconstructable ﬁle

system on ﬂash storage. In Proceedings of the 12th USENIX

Conference on File and Storage Technologies (FAST), pages 75–

88, Berkeley, CA, 2014. USENIX.

[33] Y. Lu, J. Shu, and W. Zheng. Extending the lifetime of ﬂash-based

storage through reducing write ampliﬁcation from ﬁle systems. In

Proceedings of the 11th USENIX Conference on File and Storage

Technologies (FAST), Berkeley, CA, 2013. USENIX.

100 2016 USENIX Annual Technical Conference USENIX Association

[34] C. Min, K. Kim, H. Cho, S.-W. Lee, and Y. I. Eom. SFS: random

write considered harmful in solid state drives. In Proceedings of

the 10th USENIX Conference on File and Storage Technologies

(FAST), Berkeley, CA, 2012. USENIX.

[35] J. Ouyang, S. Lin, S. Jiang, Z. Hou, Y. Wang, and Y. Wang. SDF:

Software-deﬁned ﬂash for web-scale internet storage systems.

In Proceedings of the 19th International Conference on Archi-

tectural Support for Programming Languages and Operating

Systems (ASPLOS), pages 471–484, New York, NY, USA, 2014.

ACM.

[36] X. Ouyang, D. Nellans, R. Wipfel, D. Flynn, and D. K. Panda.

Beyond block I/O: Rethinking traditional storage primitives. In

Proceedings of the 17th IEEE International Symposium on High

Performance Computer Architecture (HPCA), pages 301–311.

IEEE, 2011.

[37] S. Park and K. Shen. Fios: a fair, efﬁcient ﬂash i/o scheduler. In

Proceedings of the 10th USENIX Conference on File and Storage

Technologies (FAST), Berkeley, CA, 2012. USENIX.

[38] M. Rosenblum and J. K. Ousterhout. The design and imple-

mentation of a log-structured ﬁle system. ACM Transactions on

Computer Systems (TOCS), 10(1):26–52, 1992.

[39] K. Shen and S. Park. Flashfq: A fair queueing i/o scheduler

for ﬂash-based ssds. In Proceedings of 2013 USENIX Annual

Technical Conference (USENIX ATC), pages 67–78, Berkeley,

CA, 2013. USENIX.

[40] L. Tang, Q. Huang, W. Lloyd, S. Kumar, and K. Li. Ripq:

Advanced photo caching on ﬂash for facebook. In Proceedings of

the 13th USENIX Conference on File and Storage Technologies

(FAST), pages 373–386, Santa Clara, CA, Feb. 2015. USENIX

Association.

[41] P. Wang, G. Sun, S. Jiang, J. Ouyang, S. Lin, C. Zhang, and

J. Cong. An efﬁcient design and implementation of LSM-tree

based key-value store on open-channel SSD. In Proceedings of

the Ninth European Conference on Computer Systems (EuroSys),

pages 16:1–16:14, New York, NY, USA, 2014. ACM.

[42] W. Wang, Y. Lu, and J. Shu. p-OFTL: an object-based semantic-

aware parallel ﬂash translation layer. In Proceedings of the

Conference on Design, Automation and Test in Europe (DATE),

page 157. European Design and Automation Association, 2014.

[43] J. Yang, D. B. Minturn, and F. Hady. When poll is better than

interrupt. In Proceedings of the 10th USENIX Conference on File

and Storage Technologies (FAST), Berkeley, CA, 2012. USENIX.

[44] J. Yang, N. Plasson, G. Gillis, N. Talagala, and S. Sundararaman.

Dont stack your log on my log. In USENIX Workshop on In-

teractions of NVM/Flash with Operating Systems and Workloads

(INFLOW), 2014.

[45] Y. Zhang, L. P. Arulraj, A. C. Arpaci-Dusseau, and R. H. Arpaci-

Dusseau. De-indirection for ﬂash-based SSDs with nameless

writes. In Proceedings of the 10th USENIX Conference on File

and Storage Technologies (FAST), Berkeley, CA, 2012. USENIX.