'Big Data' describes the massive amounts of various data from online activities, location based services and many other applications.
Our research aims to extract and make sense of Big Data by providing the means to identify high quality data sources, handling of noise, as well as describing the quality of the analysis itself. The goal is to provide the domain expert or end user with transparent access to Big Data, providing explanatory components that aid the understandability.
We work specifically with the following research activities that together define, extract, manage and evaluate relevant and transparent knowledge from Big Data.
- Big Data analysis quality, reliability and information content.
As Big Data stems from a variety of sources, quality standards and noise are core issues for reliability and validity of data analysis. We devise quality measures for data and data analysis methods. For domain experts working with Big Data, we provide explanatory components that transparently provide information on data analyses for domain experts to verify and validate findings. We study active learning methods to adapt Big Data analytics to the needs of the application while reducing the load on the domain expert in training the methods.
- Handling complex textual information in Big Data.
With the recent advances in Natural Language Processing, we can tap into the various text sources available in the Big Data age. Varying and dynamically changing language use, as well as diversity in document length and complexity, require methods capable of extracting semantic concepts and organising them in representations that are easily accessible for the user. In particular, we consider the problem of data leak prevention, where sensitive content is identified prior to publication of text documents.
- Efficient scalable Big Data analysis algorithms.
Existing tools for analysis of Big Data generally assume a single-core computing model. This is in stark contrast to current computers, which feature multi-core CPUs as well as graphics cards (GPUs) that increasingly support general-purpose computation, i.e. a much wider range of operations than the graphical operations that GPUs were originally developed for. We render large-scale analysis practical for Big Data by exploiting the characteristics of modern hardware, especially the inherent parallelism of CPUs and GPUs. By levering today’s standard computers, Data Science at Big Data scale becomes widely accessible without the need of costly investments in specialised hardware.
- Performance evaluation and benchmarking for Big Data scale.
In order to establish and consolidate Big Data research, we provide evaluation methodology and benchmarks. Existing evaluation setups are poorly suited for the study of large scale algorithms, both in terms of satisfying information needs, and in terms of running times. This means that current methods are typically not subjected to realistic empirical studies, limiting the understanding of state-of-the-art. We establish procedures as well as benchmarking data and workloads, and make them available to the research community.