At this point in the process, we have the data available within Patrick Lam’s GDrive. We now need to write scripts to process that data without having access to it, and then outputting it into a file.
Normalize data
There are several steps to this.
Patrick runs a script that takes all of the entries in a column, shuffles the order, and then exports the shuffled rows as well as a mapping of the orders.
The shuffled rows are stored in a shared folder, and the mapping is kept private.
On our end, we download the shuffled rows, and manually go through each one to ensure that they’re normalized.
Once we’ve finalized the shuffled rows, we transfer back the verified shuffled rows to the shared folder.
We sort the shuffled rows back to their original row number using the private original mapping that was saved.
Process individual data
Once we’ve verified the data, this step should be pretty straightforward.
We download one column that’s shuffled.
We produce some sort of insight / visualization from it
We save that visualization
Process correlations
This one gets a bit trickier, because if it’s not done properly, we could reveal sensitive information about a person.
On Patrick’s end, it’ll grab all of the rows of multiple columns, and then shuffle both the rows without preserving the row’s integrity. That way, we can’t infer any personal information about someone since it could belong to anyone.
We download that data set and then produce a visualization/dataset that’s biased towards a certain hypothesis.
We transfer back the script to Patrick, and he’ll run the script on the real data set.
Patrick transfers back the the output, and then we can then draw conclusions
Process other stuff
This one can be a bit of anything. For example, ranking all of the coops. The process differs depending on what’s the problem that’s being solved.
Creating the report
I’ll be working on planning how the report should be laid, as well as the sections. Once we’ve finished all of the processing, we can just plug in all of the results into the report.
Data Verification
Following data needs to be verified and normalized
Gender - normalize
City and Country - normalize
HS and Uni Extracurriculars - split and normalize
Age started programming - normalize
Favourite programming language - normalize
Text editor - normalize
Co-op - normalize company name, salary, location, handle terms off filter our column
Favourite and least favourite - courses normalize
Rechoose program - normalize
Things you look for in career - normalize and categorize
SE Advice - normalize and categorize
Tools
Tools that will ease data verification and normalization
Given a column of a csv
groups and counts responses
map values to new values and saves new data
splits by token, groups and counts responses
map values within token separated strings to new values and saves new data
Process
Normalize data
Process individual data
Process correlations
Process other stuff
Creating the report
Data Verification
Tools
map values to new values and saves new data