GEOLOGY & GEOPHYSICS
Unstructured data management shifts
industry to unified IT infrastructure
Managing the associated unstruc- tured data generated by seismic images is a major investment of money, resources, and equipment. Moreover, as increased resolution
continues to come online, survey technologies in-field and processing techniques stretch
decade-old computing constructs. Raw data
sizes are also growing exponentially. The advances made in seismic data acquisition technology factor in this growth; however, post-processing, compression, high-fidelity master
archiving, and data redundancy schemes all
serve as multiplier effects on the raw data
ingest rate. For many organizations, this compounding problem quickly translates to 20 to
50 petabytes per annum saved onto one hard
drive or tape among thousands.
This rapidly-increasing accumulation of data
will further challenge an industry that already
struggles to give scientists and researchers
ready access to data and management tools that
enable advanced analytics.
Unlike structured data, moving or migrating,
unstructured data often has a negative effect on
its value. Unstructured data, when moved—
and by definition, renamed—makes finding
data again very difficult. Stated differently, be-
cause of the way today’s dominant traditional
storage systems are architected, hardware up-
grades often orphan data from core analytics
processes and the geoscientists who need it to
inform interpretations and decision-making.
At multi-petabyte ingest rates, seismic ana-
lytical tools—many of which were designed
decades ago—are quickly showing their age.
The industry is at the doorstep of a shift to
more flexible, scalable constructs designed to
specifically solve the problems of data access,
data longevity, data management, and storage.
While traditional storage technologies often
perform effectively over the near term, tradi-
tional seismic architectures and monolithic
data constructs leave a good deal of unstruc-
tured data value untapped. Generally, these
systems were not designed to span multiple-
technology refreshes or facilitate data access
over longer periods of time.
Multi-national offshore drilling companies,
oil and gas companies, and others involved in
seismic pursuits are starting to realize that the
inability to re-harvest this data over decades is
not just a technical issue, but also a business
problem that can affect future competitive-
ness. Companies that treat this data as a busi-
ness-critical asset are becoming aware of the
shortcomings within the current systems of
architectural constructs, upon which organiza-
tions have relied for the past 30 years. Finding
and accessing data readily are two must-have
features that now threaten traditional storage
approaches that for decades have focused on
getting data into their system. Until now, get-
ting the data out again has been a tertiary or,
at best, a secondary consideration, but not for
much longer.
Mission-critical data
At a basic level, seismic surveys collect
very large amounts of raw data from sen-
sors that are then filtered by supercomput-
ers or other computational constructs to
extract useful information to be analyzed
by geoscientists. When a seismic survey is
under way, the initial concern centers on
the massive and unstructured nature of the
raw data produced, which can range in size
from 100 to 400 TB a day. These files are
subject to further processes that can distill
useful information from the sea of noise. To-
day’s practice is to put this information into
a physical storage construct of some sort, a
process that will eventually fail to maximize
the availability of data to teams spread over
time and space. The problem is that once the
data is dropped into this “storage bucket,”
To maximize the production levels of res-
ervoirs, companies need to be able to com-
pare surveys of the same reservoirs taken
five, 10, or 30 years apart from one another.
While this is technically possible with sys-
tems currently in place, as a practical mat-
ter, traditional data architectures require too
much pre-processing or specialized techni-
cal expertise to readily facilitate that kind of
comparison. In many cases, decisions about
where datasets should be stored are driven
by tactical considerations such as costs and
dwindling storage space. The end-user ac-
cess need has little bearing on where and
how data is stored. This makes things chal-
lenging enough, but the ever-persistent IT
technology refresh cycle presents an even
more formidable challenge for engineers
trying to keep track of and access this data.
Lost data
The main problem is that every time data
is moved, pathnames change, links get broken, and the data map—where particular datasets are stored and the knowledge of how
to access and use them—becomes increasingly fragmented. This approach relies on
so-called tribal knowledge of this map in order to be effective over time, but when employee turnover over decades is factored in,
that knowledge is almost certain to be lost.
When it does inevitably break down, data
can no longer be reasonably located and is,
in effect, gone.
Manuel Terranova
Peaxy Hyperfiler is a data
management system
that allows companies
to create a petabyte-scale
“dataplane” that logically
combines disparate
datasets. (Photo
courtesy Peaxy)