What is Big Data?
Big data is a large set of data that’s used for analysis to discover what is happening. Big data isn’t about explaining why certain things are happening. Instead of random sampling from large quantity of information, Big Data takes all the information that’s available. Thereafter, it adds to it more relevant or sometimes not-so-relevant information and allows visualization of what is happening. Big data does not take interest in taking random samples. Instead, it consists of processes to take all data that’s available and help answer the question: What is happening. Example of big data include analyzing trains that are running late.
Example of big data – trains running late
Let’s say our motive is to study trains and their timeliness. Timetables would be used as a standard to compare it with trains’ departure and arrival times to determine if they were running late. Instead of just only collecting data when Train-A and Train-D arrived at a station, we can chose to take all trains’ timetables, arrival and departure records, stations they stopped, how long they stopped etc. and then analyze how often Train-A was late and where it was late.
This may be a relatively small subset of a 300-400 million records, but that’s not big data yet. This data is relatively easy to churn through. Let’s say that our analysis may indicate that trains that were lated stopped at Station-3 or trains going on Route-4 were late more than any other trains on other routes.
Add to this information hourly weather conditions and storms that were starting up within 50 miles of each station when the train was about to pass. Add information about drivers, their age, their experience in years and number of certifications they have had. Now we have a few billion records. We are talking about bigger dataset now. The analysis with this additional data may help us find that trains were late when we had thunderstorms with 50 miles. So far, we have not asked the question Why?. Big data is not about Why. It’s about What! Big data shows us what happened in greater detail than random sampling.
Additionally, add any events that may be happening in or around that train’s route. Add the attendance level if there were any Football games played within 50 miles of the train’s route. Add subjective events of power failures, whether trains were impacted, too many people traveling etc.
Once all of this information is aggregated, various different trends will precipitate. We may be able to say that train running on Route-4 going through Station-3 was always late when drivers were younger than 30 yrs on Saturdays. Or, trains were rarely late when Football matches were in progress. From simple to complicated relations can be visualized with big data. It is relatively easy to visualize example of big data all around us.