Analysis Plan

Process

At this point in the process, we have the data available within Patrick Lam’s GDrive. We now need to write scripts to process that data without having access to it, and then outputting it into a file.

Normalize data

There are several steps to this.
  1. Patrick runs a script that takes all of the entries in a column, shuffles the order, and then exports the shuffled rows as well as a mapping of the orders.
  1. The shuffled rows are stored in a shared folder, and the mapping is kept private. 
  1. On our end, we download the shuffled rows, and manually go through each one to ensure that they’re normalized.
  1. Once we’ve finalized the shuffled rows, we transfer back the verified shuffled rows to the shared folder.
  1. We sort the shuffled rows back to their original row number using the private original mapping that was saved.

Process individual data

Once we’ve verified the data, this step should be pretty straightforward.
  1. We download one column that’s shuffled.
  1. We produce some sort of insight / visualization from it
  1. We save that visualization

Process correlations

This one gets a bit trickier, because if it’s not done properly, we could reveal sensitive information about a person.
  1. On Patrick’s end, it’ll grab all of the rows of multiple columns, and then shuffle both the rows without preserving the row’s integrity. That way, we can’t infer any personal information about someone since it could belong to anyone.
  1. We download that data set and then produce a visualization/dataset that’s biased towards a certain hypothesis.
  1. We transfer back the script to Patrick, and he’ll run the script on the real data set. 
  1. Patrick transfers back the the output, and then we can then draw conclusions

Process other stuff

This one can be a bit of anything. For example, ranking all of the coops. The process differs depending on what’s the problem that’s being solved.

Creating the report

I’ll be working on planning how the report should be laid, as well as the sections. Once we’ve finished all of the processing, we can just plug in all of the results into the report.

Data Verification

Following data needs to be verified and normalized
  • Gender - normalize
  • City and Country - normalize
  • HS and Uni Extracurriculars -  split and normalize
  • Age started programming - normalize
  • Favourite programming language - normalize
  • Text editor - normalize
  • Co-op - normalize company name, salary, location, handle terms off filter our column
  • Favourite and least favourite - courses normalize
  • Rechoose program - normalize
  • Things you look for in career - normalize and categorize
  • SE Advice - normalize and categorize

Tools

Tools that will ease data verification and normalization
  • Given a column of a csv
  • groups and counts responses
  • map values to new values and saves new data
  • splits by token, groups and counts responses
  • map values within token separated strings to new values and saves new data