Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

книги / Практикум по написанию научной статьи на английском языке

..pdf
Скачиваний:
1
Добавлен:
12.11.2023
Размер:
1.26 Mб
Скачать

# 3

UNDERSTANDING NETWORK FAILURES IN DATA CENTERS: MEASUREMENT, ANALYSIS, AND IMPLICATIONS

Phillipa Gill

Navendu Jain

Nachiappan Nagappan

Microsoft Research

University of Toronto

Microsoft Research

phillipa@cs.toronto.edu

navendu@microsoft.com

nachin@microsoft.com

1. INTRODUCTION

Demand for dynamic scaling and benefits from economies of scale are driving the creation of mega data centers to host a broad range of services such asWeb search, e- commerce, storage backup, video streaming, high-performance computing, and data analytics. To host these applications, data center networks need to be scalable, efficient, fault tolerant, and easy-to-manage. Recognizing this need, the research community has proposed several architectures to improve scalability and performance of data center networks [2, 3, 12–14, 17, 21]. However, the issue of reliability has remained unaddressed, mainly due to a dearth of available empirical data on failures in these networks.

In this paper, we study data center network reliability by analyzing network error logs collected for over a year from thousands of network devices across tens of geographically distributed data centers. Our goals for this analysis are two-fold. First, we seek to characterize network failure patterns in data centers and understand overall reliability of the network. Second, we want to leverage lessons learned from this study to guide the design of future data center networks. Motivated by issues encountered by network operators, we study network reliability along three dimensions:

Characterizing the most failure prone network elements. To achieve high availability amidst multiple failure sources such as hardware, software, and human errors, operators need to focus on fixing the most unreliable devices and links in the network. To this end, we characterize failures to identify network elements with high impact on network reliability e.g., those that fail with high frequency or that incur high downtime.

Estimating the impact of failures. Given limited resources at hand, operators need to prioritize severe incidents for troubleshooting based on their impact to endusers and applications. In general, however, it is difficult to accurately quantify a failure’s impact from error logs, and annotations provided by operators in trouble tickets tend to be ambiguous. Thus, as a first step, we estimate failure impact by correlating event logs with recent network traffic observed on links involved in the event. Note that logged events do not necessarily result in a service outage because of failuremitigation techniques such as network redundancy [1] and replication of compute and data [11, 27], typically deployed in data centers.

31

• Analyzing the effectiveness of network redundancy. Ideally, operators want to mask all failures before applications experience any disruption. Currentdataсenter networks typically provide 1:1 redundancy to allow traffic to flow along an alternate route when a device or link becomes unavailable [1]. However, this redundancy comes at a high cost – both monetary expenses and management overheads – to maintain a large number of network devices and links in the multi-rooted tree topology. To analyze its effectiveness, we compare traffic on a per-link basis during failure events to traffic across all links in the network redundancy group where the failure occurred. For our study, we leverage multiple monitoring tools put in place by our network operators. We utilize data sources that provide both a static view (e.g., router configuration files, device procurement data) and a dynamic view (e.g., SNMP polling, syslog, trouble tickets) of the network. Analyzing these data sources, however, poses several challenges. First, since these logs track low level network events, they do not necessarily imply application performance impact or service outage. Second, we need to separate failures that potentially impact network connectivity from high volume and often noisy network logs e.g., warnings and error messages even when the device is functional. Finally, analyzing the effectiveness of network redundancy requires correlating multiple data sources across redundant devices and links. Through our analysis, we aim to address these challenges to characterize network failures, estimate the failure impact, and analyze the effectiveness of network redundancy in data centers.

Current data center networks typically provide 1:1 redundancy to allow traffic to flow along an alternate route when a device or link becomes unavailable [1]. However, this redundancy comes at a high cost – both monetary expenses and management overheads – to maintain a large number of network devices and links in the multi-rooted tree topology. To analyze its effectiveness, we compare traffic on a per-link basis during failure events to traffic across all links in the network redundancy group where the failure occurred.

For our study, we leverage multiple monitoring tools put in place by our network operators. We utilize data sources that provide both a static view (e.g., router configuration files, device procurement data) and a dynamic view (e.g., SNMP polling, syslog, trouble tickets) of the network. Analyzing these data sources, however, poses several challenges. First, since these logs track low level network events, they do not necessarily imply application performance impact or service outage. Second, we need to separate failures that potentially impact network connectivity from high volume and often noisy network logs e.g., warnings and error messages even when the device is functional. Finally, analyzing the effectiveness of network redundancy requires correlating multiple data sources across redundant devices and links. Through our analysis, we aim to address these challenges to characterize network failures, estimate the failure impact, and analyze the effectiveness of network redundancy in data centers.

32

1a. Составьте из следующих слов предложение, обращая внимание на порядок слов. Обратите внимание, что сказуемое должно быть выражено глаголом в активном залоге (Active Voice).

are, demand, scaling, for, benefits, and, dynamic, economies, from, scale, of, creation, the, driving, data, of, centers, mega.

paper, study, in, we, this, network, center, data, by, reliability, analyzing, error, network, logs, for, collected, a, thousands, over, year, network, from, devices, of.

this, goals, for, our, two-fold, are, analysis.

first, seek, network, to, failure, we, characterize, patterns, centers, in, data, and, overall, understand, the, reliability, network, of.

this, we, to, failures, end, to, characterize, network, with, identify, high, elements, on, network, impact, reliability.

reactions, since, simultaneously, all, it, occur, to, difficult, is, kinetics, investigate, the, particular, of, a, reaction.

1b. Составьте из следующих слов предложение, обращая внимание на порядок слов. Обратите внимание, что сказуемое должно быть выражено глаголом в пассивном залоге (Passive Voice).

polymer, by, to, applied, hot, is, conductor, the, temperature, at, extrusion, the.

equations, into, form, the, were, dimensionless, transformed, the.

partial, equations, solved, the, differential, were, with, numerically, method, differences, the, finite, with.

predicted, process, will, optimal, be, parameters, evaluated, and.

capacities, will, process, be, temperatures, speed, and, evaluated.

crosslinking, the, conductor, in, vicinity, a, degree, attained, the, of, prescribed, is, the.

affected, of, the, not, temperature, other, such, final, the, should, limits, be, conductor, as.

process, field, production, must, the, known, the, stationary, temperature, for, be.

surface, heat, to, transferred, cable, by, from, is, nitrogen, convection.

chemical, the, heat, reaction, released, by, reaction, is, the.

will, reaction, of, kinetics, be, the, autocatalytic, used.

simplify, the, made, to, several, were, assumptions, analysis.

range, a, assumed, properties, are, constant, in, temperature, thermodynamic.

procedure, as, reviewed, system, an, power, assessment, reliability, integrated, is.

phase, under, a, each, presented, perspective, is.

controversial, the, are, existing, highlighted, aspects.

development, support, application, to, and, the, framework, is, a, practical, established, of, tasks.

33

2.Переведите следующие предложения на английский язык, поставив глагол-сказуемое в пассивный залог.

Силовые кабели изолируют сшитым полиэтиленом.

Эти уравнения решаются методом конечных разниц.

Такие параметры, как температура и скорость процесса, будут спрогнозированы.

Ряд важных параметров рассчитывается после моделирования.

Нужная/предписанная степень сшивки достигается в конце экструзионной линии.

Значение температурного поля должно быть известно для того, чтобы получить желаемую степень сшивки.

Горячий полимер наносится на проводник методом экструзии.

Переменные процесса изменились во втором эксперименте.

Тепло переносится конвекцией от азота на поверхность кабеля вследствие/ за счет движения кабеля.

В этой статье будет использована кинетика автокаталитической реакции.

Было сделано несколько предположений для того, чтобы упростить данный анализ.

Тепловой и материальный баланс связаны со скоростью реакции.

Система трех уравнений будет численно решаться для выбранных граничных условий.

Особо выделены существующие противоречивые аспекты этого процесса.

Каждая фаза процесса описана детально.

Создана практическая основа для того, чтобы сравнить программы надежности энергосистем.

3.Перепишите предложения, поставив глагол-сказуемое в пассивный

залог.

We investigated the kinetics of certain groups of reactions.

We applied the hot polymer to the conductor by extrusion.

The performance of the insulating compounds may determine the maximum output rates of manufacturing facility.

A mathematical model describes the curing process in the vulcanization tube.

Many authors simplified problem of dicumyl peroxide decomposition with first order kinetics.

Changes of process variables may cause alterations in the physical properties.

The heat balance describes all four impacts.

34

4a. Какие из приведенных ниже языковых единиц можно использовать для соединения следующих пар предложений или частей сложного предложения?

Therefore, furthermore, although, despite of, however, first, second, nevertheless

Самым распространенным способом изоляции является сшивание полиэтиленом низкой плотности <... > этиленовый/пропиленовый каучук и полиэтилен высокой плотности стали широко использоваться в последнее время.

Полиэтилен может принимать различные слоистые формы и текстуру <...> его явный простой химический состав.

Используя различные методы <...> можно изучить кинетику различных реакций во время процесса сшивания.

Исследователи предложили несколько способов улучшения работы сетей центра обработки данных <...> проблема надежности остается нерешенной в полном объеме.

Цель проведения нами такого анализа имеет двойственный характер: <...> мы пытаемся дать характеристику шаблонов сетевых сбоев в центре обработки данных и понять то, насколько надежна вся сеть. <...> мы хотим узнать, как конструировать/строить сети центра обработки данных в будущем.

<...> идея применения методов обеспечения надежности к анализу энергосистем возникла еще в 1934 году, приняли ее не сразу.

<...> за последние 60 лет оценка надежности энергосистемы стала техникой, включающей различные методы, начиная от сбора данных до прогнозирования надежности.

4b. Пользуясь словарем, переведите все предложения на английский язык.

5. Замените выделенные курсивом слова единицами, передающими те же логико-смысловые отношения.

Nonetheless, at the same time, though, also, what is more, apart from, as well as, besides, hence, so

Using different methods, however, it is possible to investigate the kinetics of certain groups of reactions during crosslinking.

In addition to the crosslinking degree, other plant limits should not be adversely affected.

Each phase is presented under an extended perspective and the existing controversial aspects are highlighted. Therefore, a practical framework is established to support the tasks of research and design of power systems reliability programs.

35

However, it is difficult to accurately quantify a failure impact from error logs, and annotations provided by operators in trouble tickets tend to be ambiguous.

Thus, as a first step, we estimate failure impact by correlating event logs with recent network traffic observed on links involved in the event.

Nonetheless, at the same time, though, also, what is more, apart from, as well as, besides, hence, so

6. Заполните пропуски подходящими по смыслу словами.

However, although, nevertheless, thus, furthermore, therefore, as a result

a)<...> etylene/propylene rubbers and high-density polyethelene have become popular in certain quarters, the most commonly used insulation system is crosslinked low-density polyethelene (PE-XL).

b)<...>, in the last 60 years power system reliability engineering has matured into a full-blown technology encompassing myriads of techniques.

c)<...>, the issue of reliability has remained unaddressed, mainly due to a dearth of available empirical data on failures in these networks.

d)Current data center networks typically provide 1:1 redundancy to allow traffic to flow along an alternate route when a device or log becomes unavailable. <...>, this redundancy comes at a high cost to maintain a large number of network devices and links in the multi-rooted tree topology.

e)<...> only quite recently was it possible to ascertain that a consistent set of probabilistic criteria for generation planning should have general acceptance in electric utilities.

7. Переведите на русский язык следующие атрибутивные словосочетания, состоящие из имен существительных.

1)insulation system; 2) cable industry; 3) power cable; 4) vulcanization agent;

5)vulcanization tube; 6) heat resistance; 7) cable insulation; 8) cable surface; 9) energy dissipation; 10) reaction heat; 11) heat balance; 12) reaction kinetics; 13) temperature

gradient; 14) heat

conduction; 15) temperature

range; 16) boundary condition;

17) extruder screw;

18) melt flow; 19) reliability

techniques; 20) power system;

21) reliability prediction; 22) data centres; 23) research community; 24) network devices; 25) network operators; 26) failure source; 27) network element; 28) end user; 29) trouble ticket; 30) failure impact; 31) event log; 32) network redundancy; 33) error message; 34) network connectivity.

36

8. Выполните то же задание на материале словосочетаний, состоящих из трех и более имен существительных.

1) extruder screw speed; 2) extruder screw speed rate; 3) power system reliability; 4) power system reliability engineering; 5) power system reliability assessment; 6) power system reliability programs; 7) data center networks; 8) data center network reliability; 9) network error log; 10) network failure pattern; 11) failure-mitigation techniques; 12) router configuration file; 13) process temperature range.

9.Переведите следующие предложения, обращая внимание на выделенные в них слова.

Despite its apparent chemical simplicity, polyethylene is capable of exhibiting a wide range of different lamellar forms and textures.

Through crosslinking, thermoplastic polyethylene becomes a thermoset material, with no loss to its electrical properties.

Using different methods, however, it is possible to investigate the kinetics of certain groups of reactions during crosslinking.

These sets of partial differential equations (PDE) were transformed into the dimensionless form and solved numerically with the finite differences method for solving systems of PDE, named method of lines.

The planning of crosslinking plants for a large number of specific types of cables requires deciding on the production speed at which a prescribed degree of crosslinking is attained.

In this paper, we study data center network reliability by analyzing network error logs collected for over a year from thousands of network devices.

Thus, as a first step, we estimate failure impact by correlating event logs with recent network traffic observed on links involved in the event.

Operators need to focus on fixing the most unreliable devices and links in the network.

Although the idea of applying reliability techniques to power system analysis has been around as early as 1934, the pace of acceptance has been arduous.

10.Выполните обратный перевод (на английский язык), не обращаясь к исходным предложениям, и сравните свой перевод с оригиналом.

37

КЛЮЧИ К ЗАДАНИЯМ И УПРАЖНЕНИЯМ

# 2

Power cables are insulated with crosslinkable polyethylene. These equations are solved with the finite differences method.

Such parameters as process temperature and speed will be predicted. A number of important parameters are calculated after simulation.

A prescribed degree of crosslinking is attained at the end of the extrusion line. The value of temperature field must be known to attain the prescribed degree of

crosslinking.

The hot polymer is applied to the conductor by extrusion. Process variables were changed in the second experiment.

Heat is transferred by convection from nitrogen to cable surface due to the cable moving.

In this paper the kinetics of autocatalytic reaction will be used. Several assumptions were made to simplify the analysis.

Heat and material balance are connected with reaction rate.

The system of three equations will be numerically solved for the chosen boundary conditions.

The existing controversial aspects of this procedure are highlighted. Each phase of the process is described in detail.

Practical framework is established to compare the power systems reliability programs.

# 6

a)although

b)nevertheless

c)however

d)however

e)furthermore

# 7

1) система изоляции; 2) кабельная промышленность; 3) силовой

кабель;

4) вулканизирующий агент; 5) вулканизационная труба;

6) теплостойкость, тер-

мическое

сопротивление; 7) изоляция кабеля;

8) поверхность кабеля;

9) дис-

сипация,

рассеивание

энергии; 10) теплота

реакции;

11) тепловой

баланс;

12) кинетика реакции;

13) температурный градиент; 14) теплопроводность, теп-

лопередача; 15) температурый диапазон/режим; 16) граничные условия; 17) шнек

38

экструдера; 18) поток расплава; 19) методы (обеспечения) надежности; 20) энергосистема; 21) прогнозирование надежности; 22) центры обработки данных; 23) научное сообщество/круги; 24) сетевое оборудование; 25) операторы сети; 26) источник сбоя/отказа; 27) сетевой элемент; 28) конечный пользователь; 29) инцидент, листок неисправностей, трабл-тикет, ТК: заявка с уникальным номером на устранение технической неисправности; 30) последствия сбоя/отказа; 31) журнал регистрации событий; 32) избыточность сети; 33) сообщение об ошибке; 34) подключение к сети, сетевое подключение.

# 8

1) скорость шнека экструдера; 2) скоростной режим шнека экструдера; 3) надежность энергосистемы; 4) техника (обеспечения) надежности энергосистемы; 5) оценка надежности энергосистемы; 6) программы обеспечения надежности энергосистемы; 7) сети центра обработки данных; 8) надежность сетей центра обработки данных; 9) журнал регистрации ошибок сети; 10) шаблон сбоев/отказов в сети; 11) методы смягчения последствий сбоев; 12) файл конфигурации маршрутизатора; 13) температурный диапазон/ режим процесса.

39

РАЗДЕЛ III

Материалы и методы [англ. Materials and methods/ Methodology] –

данный раздел содержит информацию, которая свидетельствует о надежности и репрезентативности исходных данных, описывает процедуру эксперимента и/или методику расчетов и обосновывает их выбор. В зависимости от проблем и задач, решаемых в ходе исследования, в данный раздел включают информацию о методах анализа собранных данных и методиках производимых расчетов, методике измерений, используемых инструментах исследования, например, пакетах компьютерных программ, и проч. В целом изложение носит описательный характер, предполагающий использование языковых средств, свойственных научному стилю речи. Для русскоязычных авторов основные трудности заключаются в использовании страдательного залога, что часто предполагает изменение привычного для них порядка слов, употребление инфинитива вместо существительного с предлогом для обозначения цели или задачи, герундия для обозначения действий и процессов, корректное употребление языковых средств, обеспечивающих логичность, последовательность и связность изложения.

ЗАДАНИЯ И УПРАЖНЕНИЯ

[На материале статей: 1. An ontological approach to chemical engineering curriculum development (from Computers and Chemical Engineering, 106, 2017), A semiautomated approach for generating natural language requirements documents based on business process models (from Information and Software Technology, 93, 2018).

2. Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs (from EuroSys’11, April 10-13, 2011, Salzburg, Austria.)].

1. Переведите следующий отрывок, изменяя, где необходимо, порядок слов. Обратите особое внимание на выделенные части предложений.

The curriculum was modeled using knowledge modeling through the development of Chemical Engineering Education Ontology (ChEEdO) in the Protégé 3.5 environment. ChEEdO models topics, taught modules and the learning outcomes of the modules within the domain of chemical engineering. The learning outcomes were related to the topics using verb properties from Bloom’s taxonomy and the context of each learning outcome. The functionality of semantic reasoning via the ontology was demonstrated with the case study. The modeling results show that the ontology could be successfully utilized for curriculum development.

2. Сравните свои варианты перевода с теми, которые предложили ваши коллеги.

40