Amar Jesani shared in mfc group link to an article about v-safe data release.
The actual data could be downloaded from this website called icandecide.
The 5GB file can be extracted with p7zip to a 25GB CSV file.
$ md5sum consolidated_health_checkin.zip
53ff7a8153f44eaab4166f722b726fe1 consolidated_health_checkin.zip
$ md5sum consolidated_health_checkin.csv
345cf6ca148832141260aab8638bf0dc consolidated_health_checkin.csv
$ wc -l consolidated_health_checkin.csv
144856044 consolidated_health_checkin.csv
(That's 144 million records in this CSV file)
$ head -n 5 consolidated_health_checkin.csv
SURVEY_STATIC_ID,REGISTRANT_CODE,RESPONSE_ID,STARTED_ON,STARTED_ON_TIME,DAYS_SINCE,ABDOMINAL_PAIN,CHILLS,DIARRHEA,FATIGUE,FEELING_TODAY,FEVER,HAD_SYMPTOMS,HEADACHE,HEALTH_IMPACT,HEALTH_NOW,HEALTH_NOW_COMPARISON,VACCINE_CAUSED_HEALTH_ISSUES,HEALTHCARE_VISITS,ITCHING,JOINT_PAINS,MUSCLE_OR_BODY_ACHES,NAUSEA,PAIN,PREGNANT,PREGNANCY_TEST,RASH_OUTSIDE_INJECTION_SITE,REDNESS,SITE_REACTION,SWELLING,SYSTEMIC_REACTION,TEMPERATURE_CELSIUS,TEMPERATURE_FAHRENHEIT,TEMPERATURE_READING,TESTED_POSITIVE,TESTED_POSITIVE_DATE,VOMITING,DURATION_MINS,PREFERRED_LANGUAGE
vsafe-0-day-dose1,222-10271-84782,s244305050865137831057660547899056617007,12/31/2020,4:55:13 PM,0,,,,,Good,No,,Mild,N/A,,,,,,,,,Mild,,,,,Pain,,Headache,,,,,,,,English
vsafe-0-day-dose1,222-10325-02776,s258811629454233188277362395339553379505,05/19/2021,3:16:15 PM,0,,,,,Good,No,,,N/A,,,,,,,,,,,,,,None,,None,,,,,,,0.85,English
vsafe-0-day-dose1,222-10368-05218,s256518678527351061889187968276580937945,04/27/2021,4:11:31 PM,0,,,,,Good,No,,,N/A,,,,,,,,,,No,,,,None,,None,,,,,,,0.72,English
vsafe-0-day-dose1,222-10453-23273,s245552707728162053684731534374544736656,01/12/2021,3:31:16 PM,0,,,,,Good,No,,,N/A,,,,,,,,,,No,,,,None,,None,,,,,,,,English
As you can see there are many columns, which we will have to decode.
Combing through the whole file again and again is taking a lot of time on my computer. So I decided to write a python script that'll do all analysis in one pass of the file.
But that was taking even more time.
So I decided to put this data into postgreSQL to do the analysis.
$ sudo -u postgres createuser health
$ sudo -u postgres createdb vsafe -O health
$ cat load.sql
SET datestyle TO dmy;
CREATE table if not exists checkin (
SURVEY_STATIC_ID varchar, -- eg: vsafe-0-day-dose1
REGISTRANT_CODE varchar, -- eg: 222-10271-84782
RESPONSE_ID varchar,
STARTED_ON DATE,
STARTED_ON_TIME varchar,
DAYS_SINCE int,
ABDOMINAL_PAIN varchar,
CHILLS varchar,
DIARRHEA varchar,
FATIGUE varchar,
FEELING_TODAY varchar,
FEVER varchar,
HAD_SYMPTOMS varchar,
HEADACHE varchar,
HEALTH_IMPACT varchar,
HEALTH_NOW varchar,
HEALTH_NOW_COMPARISON varchar,
VACCINE_CAUSED_HEALTH_ISSUES varchar,
HEALTHCARE_VISITS varchar,
ITCHING varchar,
JOINT_PAINS varchar,
MUSCLE_OR_BODY_ACHES varchar,
NAUSEA varchar,
PAIN varchar,
PREGNANT varchar,
PREGNANCY_TEST varchar,
RASH_OUTSIDE_INJECTION_SITE varchar,
REDNESS varchar,
SITE_REACTION varchar,
SWELLING varchar,
SYSTEMIC_REACTION varchar,
TEMPERATURE_CELSIUS varchar,
TEMPERATURE_FAHRENHEIT varchar,
TEMPERATURE_READING varchar,
TESTED_POSITIVE varchar,
TESTED_POSITIVE_DATE varchar,
VOMITING varchar,
DURATION_MINS FLOAT,
PREFERRED_LANGUAGE varchar
);
\COPY checkin FROM 'consolidated_health_checkin.csv' DELIMITER ',' CSV HEADER
$ psql -U health vsafe -f load.sql
...
COPY 144856043
That took about 25G space as well.
Beautiful. Now we can do all kinds of querying.
Actually, not yet. There's one more thing we have to do. Create some indexes for making queries easier.
CREATE INDEX checkin_health_impact_idx ON public.checkin USING btree (health_impact);
Now, there are some issues with this data. For example:
ERROR: could not create unique index "checkin_pk"
Detail: Key (response_id)=(s252082802016465320050574992159464366472) is duplicated.
Where: parallel worker
response_id is duplicated, although it looks like every response might be unique.
But let's ignore that now for an interesting query result:
That's the distribution of the Health Impact column. 81 million responses say N/A, 56 million responses include no value (null) for this column and the tail kind of begins there.
When I do select count(distinct(registrant_code)) from checkin; I get 9,552,127 which means only 9.5 million registrant_codes are included in the dataset. Since v-safe allows adults to respond on behalf of children, it is probably likely that there are more individuals in the dataset than the registrant_codes.
Then I ran select count(distinct(registrant_code)) from checkin where health_impact like '%Get care from a doctor or other healthcare professional%'; and it returned 797,396. Which means at least 797K people checked this option (with or without other options)
Now let us look at the variable of interest, healthcare_visits. The query I ran is select healthcare_visits, count(*) from checkin where health_impact like '%Get care from a doctor or other healthcare professional%' group by healthcare_visits ;
The result is
Note that I haven't deduplicated by registrant_code here.
So I tried a different query: select count(*) from checkin where healthcare_visits like '%Hospitalization%'; the answer to which is 83,690.
Let us try deduplicating by registrant_code on that:
select count(distinct(registrant_code)) from checkin where healthcare_visits like '%Hospitalization%'; returns: 71,911
Which means, there's some amount of duplication in the row data as to registrant_codes and reports. In other words, from the same registrant_code, you can have multiple reports of Hospitalization.
This data is rather messy and I'm not exactly sure how icandecide is arriving at "individual" in their numbers because all I see are registrant_code.
Now, on to some more interesting stuff. What is the distribution of systemic_reaction in registrant_codes who reported Hospitalization?
select systemic_reaction, count(distinct(registrant_code))from checkin where healthcare_visits like '%Hospitalization%' group by systemic_reaction ;
That turned out disappointing because the result was 68,170 NULL fields.
But among the non-null fields, "None", "Fatigue or tiredness", "Headache", etc are leading. (Do note that this is a multi-value column and there could theoretically be a symptom that appears in the tail of this column multiple times thus occurring more number of times than these ones.)
I also looked at the other files available for download.
It seems like the Consolidated_health_checkin_u3[1].zip must be under 3 children. The consolidated_registrants[1].zip file makes me think that each registrant_code actually uniquely identifies an individual. Because children are having separate registrant_code with guardian registrant_code mapped in this file. The other files are about race/ethnicity and vaccine that was administered.
The under 3 file includes 116394 reports. Some of the discrepancies in number between my analysis and ICAN's dashboard probably comes from them adding both these together.
No comments:
Post a Comment