ICT Surveys

Likely one of the most common use scenarios for ICT are surveys. Base- or endlining, documentation of activities or gathering of data for technology or impact evaluation - in some way there always is a survey, or more generally, data capturing which is needed. There are many such tools available for free or paid-for solutions. We used...

KIPUS (which incidentally is also available as an open source solution), since it is our tool and we know it best and can make improvements or adjustments wherever needed. This gave us a lot of flexibility.

Converting paper-based questionnaires

What to keep in mind when transiting from the paper-based to the digital questionnaire

Most surveys do not allow for a green-field approach, rather it is likely that there are paper-based questionnaires which are recyled over the different projects with minor amendments and/ or additions. Those questionnaires must be taken into account, not the least because people are used to them and like to find their structure again in the digital solution. Also many questionnaires are quite elaborated and might even have codes which are then used for further analysis. Ignoring all of it will only lead to difficulties when rolling out the solution - people will not recognize "their" questionnaire and may not be able to use predefined code in their analysis tools. A sure way to look for trouble...

There are many ways for reaching a digitally acceptable definition from the paper-based version of a questionnaire. You can use a beautiful new description from scratch (see here the specification of the SAI trials), hoping that everybody will be able to adjust to it and perform the abstraction required for the use of it, or, you can electronically distribute or - in the worst case - just print out the paper version and allow everybody to put their wish list directly onto the digital or printed paper (see here an amended questionnaire). Working without a detailed specification usually requires multiple feedback loops (see here for feedback on an already configured questionnaire).

Likely a mix of both approches is best, where you gather changes required for the use cases in an informal way and create a formal description for it, to be reviewed by selected stakeholders and (power) users.

Users

Think about all the users, not just the data analysts

When you are working on a new survey, try to integrate all of the future users of the survey and its results; from the data scientists and the business analysts down to the data collectors needing to execute the survey. And if you can, even someone from the group who will be asked to answer the survey.

It has happened to us that wording which seemed perfectly clear to us has given rise to questions and interpretation, especially in a context where the questions had to be translated in the field. Anything unclear or interpretable will lower the quality of your data.

Also spend some time considering training on using the questionnaire. In general we have made good experience with data collectors pairing and going through the quesionnaire mutually in the language they will use in the field.

Data structures

Give structures like (dynamic) tables and lists a thought

Using a paper-based survey, people do not think about data structure. After all, it is just a piece of paper and if a box does not fit reality, they can always write just beside or beneath it, maybe even giving a valid explanation. Most systems do represent a sheet of (potentially multiple) paper logically as a single line. This is convenient, since when it comes to export, you just export one long line with many columns holding the entries made.

On the other hand, when going for a digital solution, you will almost always have to integrate lists. Let us assume you want to ask about tools: you have a list of 10 tools. You can either

create a check-mark for each of the list members or
assume that you will never have more than 5 tools and add 5 times the same choice list or
create a dynamic table which allows the user to enter exactly the number of tools they own

Solution 3 used for "Equipment on the farm"

While solution 1 is not breaking your data structure and has a fixed column for every tool (you will just have 10 columns, which makes the file longer), solution 2 saves you 5 columns, which you likely would never need, but leaves you guessing which tool was entered into which column. Solution 3 goes one step further and breaks your nice 1-line-fits-all approach. It adds the complexity of a second (list) dimension, having an unknown number of entries. You either need to resolve this through some e.g. comma-separated data in one column, or you need to cope with a two-dimensional result set with different dimensions for every list.

There is no right or wrong here, just make dure that you think about it and are sure that you are able to post-process and analyze the collected data. Most people by the way seem to opt for solution 1 or 3 with comma-separated entries in one column.

Very complex data structures

There is even more complexity in normal questionnaires: instead of just a list to establish, many questionnaires have a "questionnaire inside a questionnaire". Imagine that you want to capture live stock and for each kind of animal you want to know additional information like how many animals, how many of them male and female, what type of fodder they get - you probably get it.

Very quickly you have a questionnaire in a questionnaire or even a questionnaire in a questionnaire in a questionnaire (if you e.g. want to know in more detail what the farmer feeds to each kind of animal). If you are in such a situation, make sure that the digital solution you plan to use supports this type of data and that you can export it in a way which allows for subsequent analysis. Better invest some time up-front rather than facing the need to remediate later,

Data quality

Ensuring data quality requires cross survey thinking

When collecting your data you will want to make sure that they are of the best possible quality. One much underestimated way to achieve this is to thoroughly train the people supposed to capture the data. They are the ones who will explain and maybe even translate the questions to be answered. The better they understand what you have in mind, the better the data will be which they will collect.

A second point are consistent data. If you compile a complex questionnaire (as we have done in our project) you will likely assemble several, different, questionnaires which might ask about the same facts in different ways. Unless you intentionally want to use this as a means of data validation it might prove beneficial to weed out duplicate questions so that the survey returns a unique image without ambiguities. This might require intense discussions with different working groups contributing to the questionnaire. In our project there was a central "clearing house" making sure that everybody got what they needed.

Do not ask, what cannot be answered and limit yourself.

A pretty obvious point is not to ask what can not easily or correctly be answered. While for your research you might want to know exactly which input the farmers bought and at what price over the last years, the farmers themselves could be at a loss to tell you. Since they do not want to deceive you (and maybe are even paid for participation) they are at risk to invent replies. If you ask such questions, be extra careful when interpreting the results.

Another non-obvious factor impacting data quality is the time it takes to answer a questionnaire. Our data gatherer tell us, that 30 Minutes is a good limit, while exceeding anything over one hour will lead to fatigue and likely produce substandard quality answers and hence data.

There are also some technical aspects which do support good quality data

From a technical point of view there are also different possibilities to support data quality:

Using IDs
Master data
Value checks
Trustworthyness

Try to avoid free text answers, base the replies on predefined lists and relate them to existing analysis codes if you can.

Using IDs (or at least pre-defined choice lists) is one of the very basic approaches to tremendously improve data quality. While on a paper form anything can be written, which might be impossible to analyze later, a digital questionnaire has the advantage that predefined answers can and must be selected. This ensures that you have uniform answers which you can just count and compare. Something which is impossible with a free paper format. Of course, if you have been using IDs already on the paper questionnaire, this is nothing new to you and you use this approach automatically.

Sometimes using free text fields is a temptation even in digital questionnaires. After all, you never know if the respondent does not have some surprise for you which you have not been prepared for. While it can be useful, try to resist it as much as you can. Data analysts will be grateful!

Also think about the master data of your project. Farmers e.g. might have an ID already.

Something with is a bit more remote but pays big once you are trying to relate different surveys/ studies to each other is to pay attention to the so-called master data. Master data are data which do not change much and stay consistent over many years or at least different studies. Farmers might have an ID already, regions and villages might be known under administrative or postal codes, input might be accredited and have an accreditation number uniquely identifying it.

If you do not consider such "fixed" IDs you will harvest different (line or table row) IDs for every survey. While for one given survey this might not be a problem, it would almost be impossible to relate data from different surveys to each other later. Or you will be having a hell of a time to properly configure your ETL tools for data upload into a data warehouse.

Data-wise it is in general a good idea to identify such fixed and universal data in your project and prepare to use them as the dimensions of a BI model which you later can feed into any BI analysis tool. They all work in similar ways and if you structure the data along dimensions, hierarchies and fact tables you have a pretty good chance to analyse data using such modern tools.

Checking entries for plausability and automatic unit conversion are very useful ways to contribute to data quality.

One of the central advantages of digital solutions compared to paper-based questionnaires is the possibility to check data while they are entered. Almost all systems allow at least for the definition of required fields and some bounds checking (minimum and maximum values), many of them can check against some simple calculated values and there are even those which allow you to inject arbitrary JavaScript code for in-depth checks of data just being entered.

It makes sense to check phone numbers for national or international standards, quantities for maximum and minimum values as well as corresponding to sensible normalized values (e.g. fertilizer per ha), but also individual entries against known sums (e.g. fields against the total size of a farm). If you think a bit, it will come very naturally to you where logical checks make sense. Often it will be related to averages related to surface, like yield per ha or amount of fertilizer per ha or income per farm size. Be sure that your solution supports the type and complexity of data checks which you are expecting.

One very useful support for data entry is the conversion of units. Everybody will love you if you allow them to enter quantities in the unit they are used to while storing them in some sense-making general or even SI unit. You can even decouple completely storage unit and display unit - much the same way this is done with storage of local time in databases.

How trustworthy are your data?

As a last point in this section I want to at least mention another data quality dimension which is alas much mor difficult to handle. Data points asked in a survey can have very different trustworthiness. Let us consider the case where we want to know how much harvest a farmer had on their farm in a given year. The farmer might be able and willing to tell us - but we still do not have any idea about how correct this information is. While the answer could be very well correct (because the farmer has a good memory and nothing to hide) it could also be utterly wrong (because memory is bad or the farmer is worried about the potential impact honesty could have on tax payments) - we just do not know.

Now if we were shown an invoice about the sale, this would make us much more confident that the data is correct - let us say we attribute the data point a confidence level of 3, while the mere fact that the farmer states some number would be attributed with a confidence level of 1. Then there could be some intermediate level 2, where we have some form of corroboration of the number stated - say somebody else is confirming it. Storing the confidence level together with the data point would enable us later to much better evaluate the results of a questionnaire. We could give weight to the figures according to the confidence level, which would make a data point having a confidence level of 3 three times more important than one only having the level 1.

We have not implemented such a system in our project, but there was much discussion about the trustworthiness of data and in a future project we would likely go down that road especially for financial data which are notorously prone to errors.

Data protection

Personal data and how to sell them

Data protection is a complex subject and I do not want to go into a detailed discussion. Obviously you should have good quality software, secure computers and servers, encrypted disks in case of loss or theft and up-to-date operating systems and I want talk about any of it. I would rather write a few lines about aspects which are not so easy to settle.

The data you are collecting are potentially of huge interest - of course, otherwise you would not be collecting them :). But here I do not think about your own interest regarding your project, but the one of administrations (think taxes), organisations (think memberships) or companies (think input sales). They all would be glad to get access to the data you are collecting and the ones giving you their data might, quicker than they thought, experience negative consequences from being open with you. This calls for caution...

Do you need to personnalize?

One important point are personalized data. As long as data can not be traced back to individuals and only be used as a statistical ensemble used for statistical analysis, the danger to the invidual is minor or even not existing at all. But as soon as you link data to persons, you potentially bring them into trouble.

There are different ways to link data of a questionnaire to persons. The most evident is if you ask for their name. Right after that comes the GPS location of their farm. But also a phone number (which comes in handy if you want to contact them for participation in the project) or even a birthdate makes it easy or even trivial to access the individual behind the scenes.

Obviously identifying a person is also something very useful. Once you know that a person experiences difficulties, you can help - something you can not do, if you do not know who the person is. There is no good middle ground, either you know who somebody is, or you do not. Data protection rules in many countries say that if you relate data to persons you need their agreement and you need to be extra careful regarding the saveguarding of their data - no putting the data set on a webserver easy to access for anybody!

A compromise (which we have also used in the project) is to register the personal information (e.g. the name) in a different system, maybe even only on paper so that it can not be accessed electronically at all. The ID attributed can then be entered into the questionnaire. Someone stealing the questionnaire data will be none the wiser when it comes to the identity of the individuals behid the data.

Yes it is cumbersome and yes it takes time and makes things ess efficient. But given the downsides of personalized data running loose, you should seriously consider this option - that is, if you need personnalized data at all!

Who has access to your data?

Access to data, especially, but not only in combination with storing of personnalized data, is the second big risk factor. Someone who has access to data can also use them for their own goal or sell them to interested parties. They might even be under pressure to had over their data from local administrations or power game players. How can you protect data collected by a data collector to hand them over to somebody else?

The easiest way is not to give access to those data to the data collector. This is also the approach which we have followed in this project. The solution was configured in a way which allowed the data collector to see all the data of one questionnaire as long as the questionnaire was not finalized. As soon as the data for one questionnaire were all entered, they became inaccessible for the data collector and were, after synching, eventually removed from the device. If your solution can be configured in such a way, this is a big advantage and brings you one more step further to the goal of data protection.

Extracting data for further analysis

Eventually you will need to extract data - make sure the system gives you what you need.

At some point, when you have gathered all the data, you will need to get them out of the system for analysis and further processing. It definitely makes sense to think about that stage before you settle on a specific tool. And you should conser this also in the frame of the data structure you are going tocreate, as outlined above.

One way to post-process data is to clean them. While this should never be required - after all we have put all the energy into the interface configuration and automated checks - it still happens a lot. You have to face the question if you want to offer a way to clean the data in the system (by those who enter the data or their supervisors) or you might want to leave the responsibility for it with those who analyze the data and ultimately have to justify why and in how far they trust the data they are analysing.

To clean or not to clean?

In the first case you have - in addition to the legitimity of the cleaning process itself - to address data security and confidentiality issues (should the data collectors even see their data once they have been entered?), but you can hand over a clean data set. In the second case the data set is as good or bad as it is and it is everyone's responsibility to work with it. Be prepared to fight expectations regarding a centralized data cleaning approach.

When it comes to data export formats, there will likely be two fractions :

the research fraction
the quick win fraction

The first fraction has huge experience with the tools they are using since many years (Be it Maple, Octave, Mathematica, SPSS or PSPP). They have established analysis codes and a precise idea about how the data should be extracted so that they can make good use of them. Surprisingly often this is a line-based format with additional dimensions expanded into additional colums. It makes for huge line sizes but their software is able to crunch it.

Single rows against connected tables

On the other hand the second fraction are more self-made people wanting to have their own look into data and not willing to bother with the heavy-weight statistical tools the reseachers are rejoicing about. They have used BI-based analysis before and just want to get a model with a few tables so that they can replicate it in their BI software using a graphical interface. Averages, tops, bottoms, and basic fits aloing with easy-to-understand graphics is all they need.

There is not one better than the other and you might need to service both - that is what we have done in our project.