Artificial Intelligence in Operation Monitoring Discovers Patterns Within Drilling Reports

In well-drilling activities, successful execution of a sequence of operations defined in a well project is critical. To provide proper monitoring, operations executed during drilling procedures are reported in daily drilling reports (DDRs). The complete paper provides an approach using machine-learning and sequence-mining algorithms for predicting and classifying the next operation based on textual descriptions. The general goal is to exploit the rich source of information represented by the DDRs to derive methodologies and tools capable of performing automatic data-analysis procedures and assisting human operators in time-consuming tasks.


Classification Tasks. fastText. This is a library discussed in the literature designed to learn word embeddings and text classification. The technique implements a simple linear model with rank constraint, and the text representation is a hidden state that is used to feed classifiers. A softmax function computes the probability distribution over predefined classes.

Conditional Random Fields (CRFs). CRFs are a category of undirected graphical models that allow combination of features from each timestep of the sequence, with the ability to transit between labels for each episode in the input sequence. They were proposed to overcome the problem of bias that existed in techniques such as hidden Markov models and maximum-entropy Markov models.

Recurrent Models. Despite achieving good results in several scenarios and learning word embeddings as a byproduct of its training, the fastText classifier does not properly consider word-ordering information that can be useful for several classification tasks. Such a shortcoming can be addressed by a recurrent neural network (RNN), which considers the fact that a fragment of text is formed by an ordered sequence of words. The authors consider the gated recurrent unit variant, which is easier to train than traditional RNNs and achieves results comparable with those of the long short-term memory unit, while figuring fewer parameters to learn. The methodology of these classifiers is detailed mathematically in the complete paper.

Sequence Prediction. Sequential pattern mining can be defined broadly by the task of discovering interesting subsequences in a set of sequences, where the level of interest can be measured in terms of various criteria such as occurrence frequency, length, and profit, according to the application. The authors focus in this paper on the specific task of sequence prediction.

In the scenario considered, the alphabet is given by an ontology of operations of drilling activities. The sequence is defined according to data stored in DDRs. The proposed methodology considers various sequence prediction algorithms, specifically the following:

  • Compact prediction tree+ (CPT+)
  • Dependency graph (DG)
  • All-k order Markov (AKOM)
  • LZ78
  • Prediction by partial matching (PPM)
  • Transitional directed acyclic graph (TDAG)

These algorithms are detailed in the complete paper. The sequential pattern mining framework (SPMF) was used for algorithm implementation. SPMF is an open-source data-mining library specialized in frequent pattern mining.

Results and Discussion

Data Sets. The data sets used for the experiments reported in this paper were extracted from different collections of DDRs. Each DDR entry is a record containing a rich set of information about the event being reported, which could be an operation or an occurrence. Two different types of data sets were generated, the operations data sets and the cost data set. The former is used by both classification and sequence prediction tasks, whereas the latter is only used for classification.

Operations Data Sets. The operations data sets were extracted from DDRs of 119 different wellbores, which comprise more than 90,000 entries. The DDR fields of most interest for the experiments applied on this collection are the description and the operation name. The former is a special field used by the reporter to fill in important details about the event in a free-text format. The latter is selected by the reporter from a predefined list of operation names.

For the sequence-mining tasks, only the operation name is used. The data set is viewed as a set of sequences of operations, one for each wellbore. For the classification tasks, both fields are used for supervised learning, with the description as input object and the operation name as label.

The DDRs were preprocessed by an industry specialist with the objective of, first, removing the inconsistencies and, second, normalizing operation names to unify operations that shared semantics. Given the large number of documents, the strategy used for the former objective was to remove entries with the wrong operation name (instead of fixing each one, which would be a much harder task). As for the second objective, after an analysis of the list of operation names and samples of descriptions, each group of overlapping operations was transformed into a single operation.

This process yielded a resulting data set containing more than 38,000 samples and 39 operation types for the classification task and another containing more than 51,000 samples and 41 operations types for the sequence-prediction task.

Costs Data Sets. The costs data set is a collection of DDRs with an extra field (the target field) meant to be used for calculating the cost of each operation performed in a wellbore project. That field usually is multivalued because more than one activity of interest being described might exist in the free-text field of a DDR entry. Each value in that list is a pair containing two types of information: a label for the activity described in the entry and a number pointing to a diameter value.

As opposed to the operations data set, the target field was filled on land by a small group of employees trained specifically for this task. Nevertheless, the costs data set still had to be preprocessed before use in the experiments.

Classification Results. Before evaluating the models, the best values for each hyperparameter are determined using the validation set through a grid search. The proposed models are trained for 30 epochs.

The experimental results regarding accuracy and macro-F1 measures for the costs and operations data sets are presented in the complete paper. In both cases, the fastText classifier, despite being quite simple, yields significant results, posing a strong baseline for the proposed models. Nevertheless, one should recall that the word vectors learned by this first classifier are used as the proposed model embeddings as well.

The other neural networks also consider the complete word ordering in the samples, allowing them to achieve results better than the baseline. Such metrics are further improved by replacing the traditional Softmax layer in the output layer by a CRF. This allows the model to label each entry in the segment not only based on its extracted characteristics but also with respect to the operations ordering. This allows the model to improve the baseline accuracy by 10.94 and 3.85% in the cost and operations data sets, respectively. The proposed model learns not only the most relevant characteristics from each sample but also the patterns in the sequence of operations performed in a well-drilling project.

Sequence-Mining Results. The data set was divided into 10 segments, and the methods were evaluated according to a cross-validation protocol. The cross-­validation protocol varies the training and testing data through various executions in order to reduce any bias in the results. For the classification tasks, approaches based on word embeddings and CRFs are exploited. Evaluations were made considering sequences from size 5 to 10 in the data set, using the sequence-prediction methods to predict the next drilling operation.

Table 1 presents the accuracy obtained when considering the sequences of operations as presented in the data set. Table 2 shows the accuracy obtained when removing consecutive drilling operations from the data. The data set contains multiple repetitions of operations, contiguous to one another. This makes the data more predictable to the sequence prediction model and explains the higher accuracy obtained in experiments shown in Table 1.


DDR Processing Framework. To make the models discussed available for use in a real-world scenario, a framework is proposed that allows the end user to upload DDRs and analyze them by different applications, one for each specific purpose. One great advantage of using this framework is that the user feeds data once and then has access to several tools for analyzing them.

Currently, a working version of an application for performing the classification tasks already has been implemented. It encapsulates the classification models generated with the experiments and allows the processing of a large number of DDRs, either for operation or cost classification.


Original post:

Leave a Reply

Your email address will not be published. Required fields are marked *