Silos, Windmills and the Tower of Babel

Occasionally data analysts are asked to take existing data and put it into a system which permits some analysis. In school districts, the biggest challenge in collecting data for an in-depth analysis like a program evaluation is accommodating the various data sources. To do this task the analyst must: 1. Identify the data sources; 2. identifying what the information sources are supposed to provide the analyst; and 3. ensuring that the data is internally consistent within each data source.  But unfortunately in the analyst’s world, we might have: 1. data silos (sources of data not integrated with other data sources; 2. windmills, which have large data stores which look impressive, but are not really very useful to the analyst; and 3. the Tower of Babel, where tables do not have consistent fields to link information between tables (such as a Student ID) and inconsistent data entry within the data table.

Data Silos

School districts are like a farm with many silos. One silo holds student testing and accountability data. Another silo, Power School, holds demographic, discipline and grade information. Other silos hold information on subgroups, such as exceptional students, gifted students, and limited language learners. Each silo holds information which makes sense within that silo, but cannot be shared or merged with the data in another silo. Most importantly, the persons creating and updating the data do not confer with persons with other data silos, so the data is not able to be easily joined. A good example is a collection of data tables one for each school, that does not contain school code, student ID, or consistent names for students. If that data table holds critical information about instructional programs for students and the analyst needs to match it to test score data, the task of joining that data is overwhelming.

Data Windmills

At times when data sources are identified, a closer look exposes that the information contained in these giant tables really are not what they appear. In one school district, there were about 90 small data files that together held program information on a large special population. Basically, the data was a collection of information which could be found in other places, but a few pieces of information were only available in that file. Teachers kept the windmills spinning by adding information but it had no use outside of that data table and the few people who could access the information could not use it for analysis. 

The Tower of Babel

This problem to the analyst presents the problem of having consistent information which can be used to join tables such as a student ID field. This critical information needs to be from an accurate source, such as being pre-populated in the table from an authoritative source, not hand-entered by a user who could make data-entry errors.  If a field in the table needs to contain information about the amount of service time a student receives, then a suggested format for the data entry is needed or a drop-down select list with the possible options is even better. It is essential that the design of the data table defines the information to go into that table and most importantly define what the information must conform to be consistent. In a database, “one time per week for 30 minutes”, “1 x/week 3o minutes”, and “one 30 minute session 1 time per week” are all different and cannot be easily queried. Likewise, “Lep” and “LEP” are interpreted as two different entries in the same data table.

Conclusion

As an analyst, my advice is to have conversations with others in your organization who are at some time likely to need to share information about what your silo needs to contain and how the information is to be entered and recorded. This practice eliminates the most challenging and unnecessary part of the work of a data analyst of rebuilding tables and doing data clean up BEFORE any analysis can be performed. In some cases, the data cleanup is so large that the high cost of the data project is in the preparation, not in the analysis.

Dr. Lewis R. Johnson is owner and lead data consultant for Data Smart LLC, a North Carolina business with a mission to assist schools to use data analytics for improving student performance.