Notes on Big Data Analytics: Pitfalls and Overfitting

i. Big Data Analytics Pitfall

The power of uncovering patterns, modelling and serving insights is shown throughout big data analytics. However, size, complexity, and multidimensionality of data sets can lead to problems. A key issue is spurious correlations. With millions of variables and billions of records, purely coincidental patterns may appear statistically significant. Considering these correlations as cause-and-effect relationships can cause predictive models to be wrong and misleading. Another concern is data quality. Large data sets often contain inconsistencies, noise, or biases that can skew analyses. Interpretability can be lost when complex models are built without a clear understanding of why the predictions work, making it harder for decision-makers to trust and act on the results. Without careful validation of data, it can lead to false confidence and flawed policy or business decisions.

ii. Overfitting and Overparameterization

Overfitting occurs when a model learns the underlying trend in the data and the random noise. This typically happens when a model is too complex, such as having too many parameters relative to the amount of data. An overfit model may perform extremely well on historical (training) data but fail to generalize to new or unseen cases. In big data, overparameterization increases this risk. Overparameterization means using too many features or models that are overly flexible. For example, if a surveillance model includes irrelevant search queries that just happen to align with flu season, like “Oscar nominations”, it may identify seasonal coincidences instead of actual influenza activity. This leads to incorrect predictions when conditions change. To avoid overfitting, analysts rely on methods such as cross-validation, regularization, and out-of-sample testing. The goal is to build models that balance complexity, effectiveness, and generalizability, ensuring they remain useful when applied beyond the original dataset.

Example from Google Flu Trends: Google Flu Trends was a project that tried to track influenza outbreaks using web search data. While it initially performed well, the model later produced large errors. One reason was that search behavior was influenced by media attention and public concern and not only actual flu cases. This illustrates how big data can reflect social noise, not just health outcomes.