BinaryEdge - Science and Technology

Thoughts, stories and ideas.

Tuesday

08

September 2015

The Data Science Workflow

by Florentino Bexiga, on datascience, machinelearning

When dealing with data, it helps to have a well defined workflow. Specifically, whether we want to perform an analysis with the sole intent of "telling the story" (Data Visualisation/Journalism) or build a system that relies on data to model a certain task (Data Mining), process matters. By defining a methodology in advance, teams are in sync and it is easier to avoid losing time trying to figure out what the next step should be. This enables a faster production of results and publication of materials.

With that in mind, and following the previous blogpost about the Ashley Madison leak data analysis, we saw an opportunity to show the workflow that we are currently using. This workflow is used not only to analyse data leaks (such as the case of AshMad), but also to analyse our own internal data. It is important to mention however, that this workflow is a work in progress, in the sense that can be subjected to changes over time in order to obtain results more effectively.

Preliminary (dirty) analysis

First, one should begin by taking a first look at the data and identify any obvious issues that it may have (problems that have to be fixed) and relevant (useful) data points.

If problems that are easily fixable are identified, then a quick fix and data cleaning should be performed in order to make the data immediately usable.

At this point, a preliminary report on the data is produced, which includes a brief description of the data points (format, type, meaning). The goal is to produce publishable insightful content ASAP, in order to grasp the attention of interested parties at an early stage.

Exploratory Data Analysis

After the first quick view, a more methodical approach must be adopted. The first step is to start asking questions that could potentially be answered by the data.

The relevant data points that were previously identified must then be cleaned and filtered. The cleaning process can involve several strategies, such as removing spaces and nonprinting characters from text, convert dates, extract usable data from garbage fields and so on.

The clean data can also be converted to a format (CSV, JSON, etc.) that will facilitate its loading into an adequate framework or tool.

If some problem should occur during the loading process, it's likely that some detail escaped the cleaning process and consequently one or more of the previous steps should be reviewed.

After loading the clean and filtered data successfully, the next step is to thoroughly explore the data. The main objective is to provide an insight into the data set, i.e., transform the raw data into actual informational content.

The exploratory data analysis process involves things as the summarisation of the data, detection of outliers and anomalies or identify trends and patterns that could benefit from further study.

“Exploratory data analysis” is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there. — John Tukey

At this point one must be able to answer the question "What do we want to do with this data?".

Data Visualisation / Story Telling

If the aim is only to provide a better understand of the data, one must find the best way to present the results obtained in the exploratory analysis.

Ranging from determining the most interesting results to find the most appealing and understandable visualisation techniques, every detail is of utmost importance in order to create a story that will engage the reader. It's common that once some analysis is made, other aspects that can be interesting become noticeable and some more exploration has to be made.

"Visualization is critical to data analysis. It provides a front line of attack, revealing intricate structure in data that cannot be absorbed in any other way. We discover unimagined effects, and we challenge imagined ones." — William S. Cleveland

Once visualisations are created and the "script" of the story is aligned, the final step is to report the findings to the audience, e.g. by writing a blogpost.

Knowledge Discovery

If the aim is to explore the relationship among the variables and to understand the underlying process that generated the data in order to create a model, some additional steps must be taken.

First of all, the type of problem/task must be identified (e.g. regression, classification, clustering). Then, feature extraction and selection must be performed in order to filter relevant information from data and ascertain which of the features are most important for the specific learning task.

Once the dataset is ready for training/testing, a model can be built. Depending on the performance shown by the model when testing, it is possible that the datasets might have to be reformulated or that different algorithms have to be used in order to optimise the results.

Once we obtain a model with satisfactory performance, the results and eventual findings must be reported.

Final Remarks

As we have seen, process is important. Even more when dealing with data. Ranging from the initial phase where timely insightful results are of the essence, to afterwards when an extensive and careful analysis is required, a well-defined workflow will help a team to reach its goals effectively - whether these goals are to perform Data Journalism (tell a story) or Data Mining (model the data).

As such, the Data Science team at BinaryEdge has worked on creating a workflow that represents the work that has already been and will continue to get done. This workflow, as shown, can be divided in four main phases.

The first phase, Preliminary Analysis, is when the data is brand new, and it is imperative to produce results fast in order to get an overview of the datapoints. In this phase, the focus is to make the data usable as quickly as possible and get quick and interesting insights.

The second phase, Exploratory Data Analysis, is when "questions" will be asked systematically and when the data will be cleaned and filtered in order to answer those questions.
Accordingly to what we ultimately want to do with the data, there are two more phases that should be considered.

If we want to show the results of the Exploratory Data Analysis, we turn to the phase of Data Visualisation and Story Telling. This is when the focus is turned to "how" to present the results. The main concern in this phase is to produce data visualisations and "stories" that captivate the users while telling them all the "secrets" discovered in the original data.

Otherwise, if we want to explore the patterns in the data in order to build models, we turn to the Knowledge Discovery phase. In this phase, the focus is in producing a model that better "explains" the data, by filtering and/or creating data points (enginnering) and then testing several algorithms in search of the best performance possible.

The image below shows the full data science workflow that was described.

As mentioned before, the workflow described is not definitive and it will be in constant development in order to improve the efficiency of our work. With that in mind, we would like to invite the reader to express its opinion and/or suggestions on the comment section.

If you would like to keep up to date with our analysis and posts please consider following us on twitter, google+ and facebook.