"ETL Concepts.xlsx" there's an excel file which creates the column definitions for the flat table. It's fairly self explanatory.
flat_table_template.sql : this has instructions in it. basically you need to change a few things, add the column definitions created for you in the excel file in the right places (they are marked) and you get a script to produce the flat table for the concepts specified. These sql scripts are designed to only take any new data after the last time that particular table was updated.
You will see a number of files already created. Its faster to have smaller tables when building them so I divided concepts up into categories. Ideally, you combine concepts that are likely to be answered at the same time. This limits the amount of null space in the table. Right now i'm combining by content but its somewhat arbitrary and certainly doesn't maximize the principle i just stated. it's probably best to combine these by encounter type.
You'll see a folder called "reporting indicator definitions". These contain the sql code used by the reporting team to pull out a cohort as well as the sas used to generate their reports. I haven't checked to carefully yet but my sense is there are at least a few (if not many) mistakes in how they made their definitions. for example, for the on cotrimoxazole cohort, anyone who has an obs with a value coded equal to bactrim gets included. obviously bactrim might be a prescription for other uses. I think you get the point.
flat_moh_731_indicators.sql : so this is the file that will create the encounter level calculations to figure things out like if the patient is pregnant or on arvs. basically it uses the columns from the flat tables to calculate a result.
I'll need to create another file, "moh_731_report" which then comes up with queries which aggregates over a clinic and time period. This will produce the actual report. I'm holding off on that until I have all the indicators defined.
Lab data : this is a bit of an issue. I know in the past Jeremy decided to dump the lab data as obs without encounter_ids. i think this is a mistake as it forces us to use a separate process for denormalizing this data since i'm denormalizing based on encounter_id. I don't think it makes sense to use an alternative column to denormalize. So, I'm wondering if we should change the data and add encounters for all the lab data. What are your thoughts?
The new data layer. ETL really isn't just about making for fast queries. It's also about developing a set of calculations based on existing data. This data layer has a lot more meaning than obs. It (should) converts data to information/truth about (in our case) patients. The alternative would be to have a layer which does runtime calculations on data to provide the same information. In our case, we need to solve a reporting issue which makes run time calculations a non-viable alternative given the time it takes to run those calculations. So, we'll see how it works, but the ETL layer is now a very powerful tool. So, the question is what's the best way of designing this tool and how to we incorporate it into OpenMRS work flows. My current version is really just to get us by for now. But over the next year, it would be very cool to figure out this problem. Would love to hear both short-term (e.g. should we expose these etl tables via rest services which we could then use in our ui/angular?) and long term (i.e. the discussion Burke and I were having on the plane).
Powered by a free Atlassian Confluence Community License granted to Indiana Institute for Global Health, Inc. Evaluate Confluence today.