Data Exploration Vs. Presentation

Data exploration

Finally, we want to touch on the difference between using visualization for data exploration, and for presenting results to stakeholders. The plots and tips that we’ve discussed try to make the details of the data as clear as possible for the data scientist to see structure and relationships. These technical graphs don’t always effectively convey the information that needs to be conveyed to non-technical stakeholders. For them, we want crisp graphics that focus on the message we want to convey. We will touch more on this topic in Module 6, but for right now we’ll share a small example.

 

As data scientists, information that can be relevant to downstream analysis. The account values are distributed approximately lognormally, in the range from 100 to 10M dollars. The median account value is in the area of $30,000 (10^4.5), with the bulk of the accounts between $1000 US and $1M US dollars. It would be hard to explain this graph to stakeholders. For one thing, densityplots are fairly technical, and for another, it is awkward to explain why you are logging the data before showing it. You can convey essentially the same information by partitioning the data into “log-like” bins, and presenting the histogram of those bins, as we do in the bottom plot. Here, we can see that the bulk of the accounts are in the 1000-1M range, with the peak concentration in the 10-50K range, extending out to about 500K. This gives the stakeholders a better sense of the customer base than the top graphic would. [Note – the reason that the lower graph isn’t symmetric like the upper graph is because the bins are only “log-like”. They aren’t truly log10 scaled. Log10 scaled bins would be closer to: 1-3K, 3K-10K, 10K-30K….. As an exercise, we could try splitting the bins that way, and we would see that the resulting bar chart would be symmetric. The bins we chose, however, might seem more “natural” to the stakeholders.