Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
АНГЛИЙСКИЙ__МОЙ - копия.doc
Скачиваний:
28
Добавлен:
13.08.2019
Размер:
1.83 Mб
Скачать

4. Summarize the text using the words from Vocabulary Exercises.

B. Vocabulary

5. Give English-Russian equivalents of the following words and ex- pressions:

apparent; нечто целое; divergence; десятичный; assume; очевид- ный, явный, открытый; integer; предполагать; roughly; скорость,

быстрота; conversely; объединение, соединение; rudimentary; доход, выручка; validation; грубо, приблизительно, примерно; overt; исходные данные; deconvolution; получать, извлекать; decimal; наоборот; raw data; выводить, прослеживать; query; видимый, несомненный, очевидный; capture; проверка данных; tabulation; запрос (критерий поиска объектов в базе данных), вопрос; derive; нахождение оригинала функции; deduce; дол- говременное хранение; perennially; добыча; mining; сведение в таблицы; warehousing; несоответствие, расхождение; aggregation; элементарный; revenue; всегда, постоянно; velocity; собирать (данные).

6. Find the word belonging to the given synonymic group among the words and word combinations from the previous exercise:

  1. basic, elementary, simple, undeveloped;

  2. suppose, believe, presume, take for granted, imagine, think, guess;

  3. justification, confirmation, examination, control, verification, testing;

  4. approximately, about, more or less, generally, almost, som;- diing like, just about, in die region of;

  5. collect, secure, attain, gain, acquire, obtain (data);

  6. income, profits, returns, proceeds, takings;

  7. obvious, clear, evident, noticeable, perceptible, visible;

  8. get, receive, draw from, take, gain;

  9. inquiry, question;

  1. unconcealed, explicit, open, plain, obvious, clear;

  2. trace, monitor, observe, figure out, work out;

  3. the whole, total, unit;

  4. speed, rate, rapidity, swiftness, pace, haste, quickness;

  5. on the other hand, on the contrary, in opposition;

  6. output, extraction, production, getting;

7. Read the text and answer the questions: 1) What is data validation aimed at ? 2) What is used for the verification ? 3) What can incorrect data validation lead to? Use the words at the bottom.

178 —

—-179

Data Validation

In computer science, data validation is the process of ensuring that a program operates on clean, correct and useful data. It uses routines, often called «validation rales» or «check routines», that check for correctness, meaningfulness, and security of data that are input to the system. The rales may be implemented through the automated facilities of a data dictionary, or by the inclusion of explicit applica- tion program validation logic.

The simplest data validation verifies that the characters provided come from a valid set. For example, telephone numbers should include the digits and possibly the characters +,—,(, and) (plus, minus, and parentheses). A more sophisticated data validation routine would check to see the user had entered a valid country code, i.e., that the number of digits entered matched the convention for the country or area specified.

Incorrect data validation can lead to data corruption or security vulnerability. Data validation checks that data are valid, sensible, reasonable, and secure before they are processed.

Some methods used for validation are:

Format or picture check checks that the data is in a specified format (template), e.g., dates have to be in the format DD/MM/YYYY.

Data type check checks the data type of the input and gives an error message if the input data does not match with the chosen data type. For instance, in an input box accepting numeric data, if the letter 'O' was typed instead of the number zero, an error message would appear.

Range check checks that the data lies within a specified range of values, e.g., the month of a person's date of birth should lie between 1 and 12.

Limit check checks data for one limit only, upper OR lower, e.g., data should not be greater than 2 ( >2).

Presence check checks that important data are actually present and have not been missed out, e.g., customers may be required to have their telephone numbers listed.

Check digits are used for numerical data. An extra digit is added to a number which is calculated from the digits. The computer checks this calculation when data are entered, e.g., the ISBN for a book. The last digit is a check digit calculated using a modulus 11 method.

Batch toted checks for missing records. Numerical fields may be added together for all records in a batch. The batch total is entered and the computer checks that the total is correct, e.g., add the "Total Cost' field of a number of transactions together.

Hash total is just a batch total done on one or more numeric fields which appears in every record, e.g., add the Telephone Numbers together for a number of Customers.

Spelling check looks for spelling and grammar errors.

Consistency check checks fields to ensure data in these fields cor- responds, e.g., if Title = «Mr.», then Gender = «M».

Cross-system consistency check compares data in different systems to ensure it is consistent, e.g., the address for the customer with the same it is the same in both systems. The data may be represented differ- ently in different systems and may need to be transformed to a common format to be compared, e.g., one system may store customer name in a single Name field as 'Doe, John Q', while another in three differ- ent fields: First_Name (John), Last_Name (Doe) and Middle_Name (Quality); to compare the two, the validation engine would have to transform data from the second system to match the data from the first, for example, using SQL: LastName 11 ', ' 11 First_Name 11 substr (Middle_Name, 1, 1) would convert the data from the second system to look like the data from the first 'Doe, John Q'.

verify; valid; parenthesis (pi. parentheses); sophisticated; vulnerabil- ity; input box; batch total; transaction

8. Match the beginning of each sentence from the left column with the rest part of it in the right column. Translate the sentences.

1) Spelling check a) checks that the data is in a specified

format (template), e.g., dates have to be in the format DD/MM/YYYY.

2) Data type check b) compares data in different systems to

ensure it is consistent, e.g., the address for the customer with the same it is the same in both systems.

3) Format or picture check c) checks that important data are actually

present and have not been missed out

180 —

— 181

  1. Cross-system consistency d) looks for spelling and grammar errors, check

  2. Batch total e) checks fields to ensure data in these

fields corresponds.

6) Limit check 0 checks that die data Iks within a speci-

fied range of values

  1. Hash total g) are used for numerical data.

  2. Consistency check h) gives an error message if die input data

does not match with the chosen data type.

9) Range check i) is just a batch total done on one or more

numeric fields which appears in every record

10) Check digits j) checks data for one limit only, upper

OR lower, e.g., data should not be greater than2(>2).

11) Presence check k) checks for missing records.

9. Read the text, divide it into parts and give the title to each of them. Make a one-sentence summary of each part of the text.

Data Mining

Data mining is the process of sorting through large amounts of data and picking out relevant information. It is usually used by business intelligence organizations, and financial analysts, but is increasingly being used in the sciences to extract information from die enormous data sets generated by modern experimental and observational meth- ods. It has been described as «the nontrivial extraction of implicit, previously unknown, and potentially useful information from data» and «the science of extracting useful information from large data sets or databases.» Data mining in relation to enterprise resource planning is the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making.

Traditionally, business analysts have performed die task of extract- ing useful information from recorded data, but the increasing volume of data in modern business and science calls for computer-based ap- proaches. As data sets have grown in size and complexity, there has been a shift away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools.

The modern technologies of computers, networks, and sensors have made data collection and organization much easier. However, the captured data needs to be converted into information and knowledge to become useful. Data mining is the entire process of applying com- puter-based methodology, including new techniques for knowledge discovery, to data. Data mining identifies trends within data that go beyond simple analysis. Through the use of sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of business processes and target opportunities. However, abdicating control of this process from the statistician to me machine may result in false-positives or no useful results at all.

Although data mining is a relatively new term, the technology is not. For many years, businesses have used powerful computers to sift through volumes of data such as supermarket scanner data to produce market research reports (although reporting is not considered to be data mining). Continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy and usefulness of data analysis.

The term data mining is often used to apply to the two separate processes of knowledge discovery and prediction. Knowledge discov- ery provides explicit information mat has a readable form and can be understood by a user. Forecasting, or predictive modeling provides predictions of future events and may be transparent and readable in some approaches (e.g., rule-based systems) and opaque in others such as neural networks. Moreover, some data-mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rauter than knowledge discovery.

Metadata, or data about a given data set, are often expressed in a condensed data-minable format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts.

Data mining relies on the use of real world data. This data is extremely vulnerable to collinearity precisely because data from the real world may have unknown interrelations. An unavoidable weakness of data mining is that the critical data that may expose any relationship might have never been observed. Alternative approaches using an experiment-based approach such as Choice Modelling for human-generated data may be used. Inherent correlations are either

— 182

— 183

controlled for or removed altogether through the construction of an experimental design.

Recently, there were some efforts to define a standard for data mining, for example die CRISP-DM standard for analysis processes or the Java Data-Mining Standard. Independent of these standardiza- tion efforts, freely available open-source software systems like Rap- idMiner and Weka have become an informal standard for defining data-mining processes.

There are also privacy and human rights concerns associated with data mining, specifically regarding the source of the data analyzed. Data mining provides information that may be difficult to obtain oth- erwise. When die data collected involves individual people, there are many questions concerning privacy, legality, and ethics. In particular, data mining government or commercial data sets for national security or law enforcement purposes has raised privacy concerns.

Since the early 1960s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for ЗхЗ-chess) wim any beginning configuration, small-board dots-and-boxes, small-board- hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recogni- tion approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with me tablebases, combined wim an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.

Data mining in customer relationship management applications can contribute significantly to the bottom line. Rather than contacting a prospect or customer through a call center or sending mail, only prospects that are predicted to have a high likelihood of responding to an offer are contacted. More sophisticated methods may be used to optimize across campaigns so that we can predict which channel and which offer an individual is most likely to respond to — across all potential offers. Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given

an offer. Data clustering can also be used to automatically discover the segments or groups within a customer data set.

In recent years, data mining has been widely used in area of science and engineering, such as bioinformatics, genetics, medicine, education, and electrical power engineering.

In the area of study on human genetics, the important goal is to understand die mapping relationship between the inter-individual varia- tion in human DNA sequences and variability in disease susceptibility. In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as cancer. This is very important to help improve the diagnosis, prevention and treatment of the diseases. The data mining technique that is used to perform mis task is known as multifactor dimensionality reduction.

In die area of electrical power engineering, data mining tech- niques have been widely used for condition monitoring of high volt- age electrical equipment. The purpose of condition monitoring is to obtain valuable information on the insulation's health status of the equipment. Data clustering such as self-organizing map (SOM) has been applied on die vibration monitoring and analysis of transformer on-load tap-changers(OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Obviously, different tap positions will generate different signals. However, mere was considerable variability amongst normal condition signals for the exact same tap position. SOM has been applied to detect abnormal conditions and to estimate the nature of die abnormalities.

Data mining techniques have also been applied for dissolved gas analysis (DGA) on power transformers. DGA, as a diagnostics for power transformer, has been available for centuries. Data min- ing techniques such as SOM has been applied to analyze data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle.

A fourth area of application for data mining in science/engineering is within educational research, where data mining has been used to study me factors leading students to choose to engage in behaviours which reduce their learning and to understand the factors influencing university student retention.

^184 —

— 185 —

Other examples of applying data mining technique applications are biomedical data facilitated by domain ontologies, mining clinical trial data, traffic analysis using SOM, et cetera.

implicit; aid; shift; abdicate; sift; opaque; neural network; gear; meta- data; abstract; collinearity; law enforcement; oracle; hex; insightful; bottom line; prospect; susceptibility; condition monitoring; dissolved gas; clinical trial

10. Choose the most suitable word? from those given in brackets without consulting the text. Prove your choice.

  1. Data mining in customer relationship management applications can contribute significantly to the ... . (delay line, bottom line)

  2. Some data-mining systems such as ... are inherently geared towards prediction and pattern recognition, rather than knowl- edge discovery, (enterprise application, neural networks)

  3. Data mining that relies on the use of real world data is extremely vulnerable to... precisely because data from the real world may have unknown interrelations, (permutation, collinearity)

  4. As data sets have grown in size and complexity, there has been a ... away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools, (shift, aid)

  5. ... recognition approaches do not seem to fully have the required high level of abstraction in order to be applied suc- cessfully. (Cross-platform, Current pattern)

  6. The process of data mining has been described as «the non- trivial extraction of..., previously unknown, and potentially useful information from data» and «the science of extracting useful information from large data sets or databases.» (implicit, versatile)

  7. ..., or data about a given data set, are often expressed in a condensed data-minable format, or one that facilitates the practice of data mining. (Metadata, Chunk)

  8. Other examples of applying data mining technique applications are biomedical data facilitated by domain ontologies, mining

... data, traffic analysis using SOM, et cetera, (clinical trial, mapping) 9. Data mining government or commercial data sets for national security or... purposes has raised privacy concerns, (lawsuit, law enforcement)

  1. In the area of study on human genetics, the important goal is to understand the mapping relationship between the inter- individual variation in human DNA sequences and variability in disease .... (inconsistency, susceptibility)

  2. Rather than contacting a ... or customer through a call center or sending mail, only prospects that are predicted to have a high likelihood of responding to an offer are contacted, (suc- cessor, prospect)

  3. The purpose of ... is to obtain valuable information on the insulation's health status of the equipment, (warehousing, condition monitoring)

11. Translate the text without a dictionary.

Data Cleaning

Data cleaning is an essential step in populating and maintaining data warehouses. Owing to likely differences in conventions between the external sources and the target data warehouse as well as due to a variety of errors, data from external sources may not conform to the standards and requirements at the data warehouse. Therefore, data has to be transformed and cleaned before it is loaded into the warehouse so that downstream data analysis is reliable and accurate. This is usually accomplished through an Extract-Transform-Load (ETL) process.

Typical data cleaning tasks include record matching, deduplica- tion, and column segmentation which often go beyond traditional relational operators. This has led to development of utilities that support data transformation and cleaning. Such software falls into two broad categories. The first category consists of verticals such as Trillium that provide data cleaning functionality for specific domains, e.g., addresses. By design, these are not generic and hence cannot be applied to other domains. The other category of software is diat of ETL tools such as Microsoft SQL Server Integration Services (SSIS)

— 186

— 187 —

that can be characterized as «horizontal» platforms that are applicable across a variety of domains. These platforms provide a suite of opera- tors including relational operators such as select, project and equijoin. A common feature across these frameworks is extensibility — appli- cations can plug in their own custom operators. A data transformation and cleaning solution is built by composing these (default and custom) operators to obtain an operator tree or a graph.

While the second category of software can in principle support ar- bitrarily complex logic by virtue of being extensible, it has the obvious limitation that most of the data cleaning logic needs to be incorporated as custom code since creating optimized custom code for data cleaning software is nontrivial. It would be desirable to extend its repertoire of «built-in» operators beyond traditional relational operators with a few core data cleaning operators such that with very less extra code; we can obtain a rich variety of data cleaning solutions.

owing to; downstream; Extract-Transform-Load (ETL); equijoin; extensibility; by virtue of smth.; repertoire

12. Translate the text into English.

Обработка данных

Обработка данных — обобщенное наименование разно- родных процессов выполнения последовательности операций над данными. Термин нашел преимущественное применение в контексте с вычислительной техникой и разного рода автомати- зированными системами (информационными, библиотечными, управленческими и др.) и, как правило, относится к рутинным операциям обработки и хранения больших массивов документов и данных.

Термины, связанные с видами обработки данных: Интегрированная обработка данных — принцип организации обработки данных в автоматизированной системе, при котором процессы или операции, ранее выполнявшиеся в различных организациях, подразделениях или участках технологической цепи, объединяются или оптимизируются с целью повышения эффективности системы. Одной из возможных целей «интег-

рированной обработки данных» является создание интегриро- ванных баз данных.

Распределенная обработка данных — обработка данных, проводимая в распределенной системе, при которой каждый из технологических или функциональных узлов системы мо- жет независимо обрабатывать локальные данные и прини- мать соответствующие решения. При выполнении отдельных процессов узлы распределенной системы могут обмениваться информацией через каналы связи с целью обработки данных или получения результатов анализа, представляющих для них взаимный интерес.

Автоматизированная обработка (данных/документов) — обработка (данных или документов), выполняемая автомати- чески, без участия человека или при ограниченном его участии. Техническими средствами реализации «автоматизированной обработки» могут быть ЭВМ или иные устройства, машины.

Машинная обработка — выполнение операций над данными с помощью ЭВМ или других устройств обработки данных.

Предмаишнная обработка, подготовка данных для вво- да — этап аналитико-синтетической переработки или обработки документов, связанный с формализацией итоговых документов и записью их содержания на рабочий лист. Часто с этим эта- пом также связывают и ввод документов в ЭВМ, в том чис- ле — клавиатурный ввод и бесклавиатурный ввод (например, с использованием сканера).

Сортировка — автоматическое или ручное распределение документов или данных по каким-либо заданным признакам.

Обновление файла — совокупность процессов, связанных с приведением записей в файле в соответствие с последними изменениями в предметной области или полученными новыми сведениями (данными). «Обновление файла» предполагает вы- полнение следующих операций: просмотр записей, добавление новых записей, стирание (удаления) или исправление (редакти- рование) существующих записей.

подразделение — subdivision, sub-unit; участок цепи — subcircuit; узел (сети) — node, unit; предмашинная обработка — data preparation; бесклавиатурный — nonkeyboarding; ручной —

— 188

189 —

manual; заданный — specified; признак — character, criterion, (начала или окончания блока данных) marker, sign; просмотр — (напр., файла) browsing, (от начала к концу, напр., информаци- онного массива) drop, look-up, overview; удаление — deletion, demounting, (ненужной информации) purge, removal, removing

13. Talking points:

  1. Data processing: its definition, elements and application.

  2. Data processing system.

  3. Data validation.

  4. Data mining.

  5. Data cleaning.

  6. Data processing types.