Big Data Analysis

'Big Data' describes the massive amounts of various data from online activities, location based services, and many other applications. As defined by IBM, there are four main challenges in Big Data:

  • Volume: the amount of data is growing enormously, 
  • Velocity: the data changes rapidly and constantly,
  • Variety: the data stems from various sources and formats, and
  • Veracity: there is uncertainty in the data.

Over time, additional issues have been identified, adding to our understanding of Big Data as great data sources with challenges in complexity and scale.

Data Science sets out to extract information and to generate knowledge from this data. An important part of knowledge generation relies on data analysis methods to automatically extract patterns. Users may build knowledge from these patterns, with the potential to analyse and ultimately understand Big Data.

However, at its current level, Big Data analytics falls dramatically short of this potential. In particular, many different data analysis techniques exist, and the variety and complexity in Big Data lead to an explosion in the number of potential patterns. Large volumes of potential patterns, possibly volatile and of unknown relevance to the user fail to lighten the information overload, and might even worsen it.

At a crucial time where the research community, governments and major companies like IBM, Oracle and Google recognize the potential in analysing Big Data, we still lack a well-founded approach to support data owners and data analysts in automatically identifying the most relevant patterns.

Our research activity address these issues in extracting and making sense of Big Data by providing the means to identify high quality data sources, handling of noise, as well as describing the quality of the analysis itself. The goal is to provide the domain expert or end user with transparent access to Big Data, providing explanatory components that aid the understandability. In particular, we make information from unstructured text available, even for highly dynamic and complex domains, while ensuring that sensitive data is not leaked unintentionally. We ensure computationally efficient and scalable methods for standard computers that allow subjecting large data volumes to data analysis without costly investments into specialized hardware, and that render Big Data analytics feasible in interactive settings.  We substantiate our approach by focusing on evaluation methodology and benchmarking scenarios that reflect the challenges met in real Big Data applications, thereby establishing evaluation and benchmarking infrastructure to allow industry, authorities and end users to understand the technology produced, and to identify the potential for their business, administration or application interest.

In the following, we describe our research activities in four core areas that together define, extract, manage and evaluate relevant and transparent knowledge from Big Data.

  1. Big Data analysis quality, reliability and information content.
    As Big Data stems from a variety of sources, quality standards and noise are core issues for reliability and validity of data analysis. We will devise quality measures for data and data analysis methods. For domain experts working with Big Data, we provide explanatory components that transparently provide information on data analyses for domain experts to verify and validate findings. We study active learning methods to adapt Big Data analytics to the needs of the application while reducing the load on the domain expert in training the methods.
  2. Handling complex textual information in Big Data.
    With the recent advances in Natural Language Processing, we can tap into the various text sources available in the Big Data age. Varying and dynamically changing language use, as well as diversity in document length and complexity, require methods capable of extracting semantic concepts and organizing them in representations that are easily accessible for the user. In particular, we consider the problem of data leak prevention, where sensitive content is identified prior to publication of text documents.
  3. Efficient scalable Big Data analysis algorithms.
    Existing tools for analysis of Big Data generally assume a single-core computing model. This is in stark contrast to current computers, which feature multi-core CPUs as well as graphics cards (GPUs) that increasingly support general-purpose computation, i.e., a much wider range of operations than the graphical operations that GPUs were originally developed for. We render large-scale analysis practical for Big Data by exploiting the characteristics of modern hardware, especially the inherent parallelism of CPUs and GPUs. By levering today’s standard computers, Data Science at Big Data scale becomes widely accessible without the need of costly investments in specialized hardware.
  4. Performance evaluation and benchmarking for Big Data scale.
    In order to establish and consolidate Big Data research, we provide evaluation methodology and benchmarks. Existing evaluation setups are poorly suited for the study of large scale algorithms, both in terms of satisfying information needs, and in terms of running times. This means that current methods are typically not subjected to realistic empirical studies, limiting the understanding of state-of-the-art. We establish procedures as well as benchmarking data and workloads, and make them available to the research community.


Ira Assent

Associate professor