The validity of self-reported cancer in a population-based cohort compared to that in formally registered sources

Background: Self-reported cancer has been validated with heterogeneous results across populations. The aim was to assess the validity of self-reported cancer in the Lifelines population-based cohort and to search for explanations for not reporting cancer. Methods: Data from adult participants (n = 152,780) from Lifelines was linked to the Dutch-Nationwide pathology databank (PALGA), which has nearly 100% coverage of cancer diagnoses in the Netherlands and is considered as the gold standard for ascertainment of cancer diagnosis in this study. Sensitivity and positive predictive value (PPV) for self-reported cancers -reported as hand-written free text-were described. Logistic regressions analyses were performed to evaluate whether socio-demographic factors were associated with the presence of self-reported cancer when there was a diagnosis in PALGA. Results: 6611 (4.50%) participants had at least one self-reported diagnosis of cancer, where 9960 (6.97%) participants had at least one cancer diagnosis in PALGA. The sensitivity of self-reported cancer was 64.68% [95% CI:63.71 – 65.66], and 70.18% [95%CI:68.83 – 71.56] after excluding skin and cervical cancers. Skin and cervical cancers represented 61.24% of non-self-reported cancers. The overall PPV was 97.45% [95%CI:97.45 – 97.81], and 97.33% [95%CI:96.72 – 97.82] after the exclusion of skin and cervical cancers. Participants who did not self-report their cancer were more likely to be male, had longer time since diagnosis and lower educational level. Conclusion: Overall, the reports of cancer in Lifelines have a high positive predictive value and moderate sensitivity. One third of the cancers were not reported, mainly skin and cervical cancers. Male participants, those with a lower educational level and those with longer time since diagnosis were less likely to self-report a diagnosed cancer.


Introduction
Cancer incidence is estimated at 18 million new cancer cases in 2018, and cancer caused 9.6 million deaths or 1/6 of global mortality in the same year, making it one of the leading causes of mortality.Low and middle income countries account for 70% of the total cancer mortality [1].In the Netherlands, the incidence of cancer has been increasing steadily since 2011 with approximately 116,500 new cases in 2018.The ten-year prevalence of cancer in the Netherlands on January 1st, 2018, was approximately 571,304, or about 3.5% of the Dutch population.Cancer mortality was about 45,513 people for the same year [2].Large population-based cohort studies have become a useful tool for understanding the impact of unhealthy lifestyles and other potential risk factors on the incidence, prevalence and mortality of many different diseases, including cancer [3].Many of these population-based cohorts are based on self-reported data, and it can be questioned how valid these self-reported data are.The use of self-reported data can lead to potential recall biases, as certain information may be incomplete, forgotten or incorrectly reported by the participants [4,5].To evaluate the validity of self-reported cancer, some studies have compared the self-reported data with medical records: in Europe [4,6,7], Asia [8][9][10] or Australia [11][12][13].The levels of agreement varied across studies, some of those reported a sensitivity below 60% [6,7,13], whereas others reported a sensitivity over 70% [8,11].Main explanations for the differences in agreement were the specific site of cancer (i.e., breast, prostate, skin, cervical) and the possible variations due to the cultural differences of the population included.For instance, skin cancer was included in calculating agreement in the study from Asia, where skin cancer incidence is lower [8], while it was excluded in the study from Spain due to a higher incidence and potential misreporting [7].In addition, studies found higher sensitivities associated with a higher educational level [7][8][9]14].Considering these diverse results on the validity of self-reported cancer diagnoses, more evidence is needed for assessing other possible sources of variation in validation studies.
The aim of this study was to assess the validity of self-reported cancer diagnoses in a large population-based cohort of adults from the north of the Netherlands as compared to the data from the nationwide Dutch Nationwide Pathology Databank (PALGA), which is going to be considered in this study as the gold standard, and to search for explanations for not reporting cancer.

Lifelines
Lifelines is a multidisciplinary prospective population-based cohort study examining in a unique three-generation design the health and health-related behaviours of more than 167,000 persons (or 10% of the population) living in the North of The Netherlands.The baseline assessment started in 2006 and was completed in 2013, for that reason the year 2006 was used as the year for stratification [3,15].The Lifelines cohort was primarily set up to study the aetiology of healthy ageing and encompasses long-term follow-up, many participants, an extensive biomaterial, and data collection, including multiple exposure variables and endpoints.Lifelines is conducted according to the Declaration of Helsinki and approved by the medical ethics committee of the Universitair Medical Center Groningen (UMCG) (no.2010/109) and is ISO certified (9001:2008 Healthcare).A written informed consent was collected from all participants.The present study was conducted as a cross-sectional approach, in which were included all participants ≥ 18 years (152,780) from the baseline assessment in the Lifelines cohort.

PALGA
PALGA is the nationwide network and registry of histopathology and cytopathology in the Netherlands, which has nationwide coverage since 1991, comprising practically 100% of the pathology reports in The Netherlands [16,17].PALGA diagnoses are in line to the ICD-10 [16][17][18].In the present study, the cancer diagnoses registered by PALGA Foundation were considered as the gold standard.In the PALGA records, a participant can have several diagnoses of cancer, including metastatic cancers.For that, per participant, up to the ten most recent cancer diagnoses were used from PALGA.

The collection of cancer data in Lifelines
Self-reported data on cancers were collected by Lifelines in the form of (hand) written free text answers on a health survey in the baseline assessment [3,15].Participants were considered as having a history of self-reported cancer diagnosis when they answered affirmatively to the question if they had ever been diagnosed with cancer and provided a cancer diagnosis.We calculated the calendar year range by subtracting the age of their self-reported cancer diagnosis from the age when the questionnaire was filled in.

The handling of cancer data in Lifelines
An algorithm was developed to categorise the self-reported cancer free text information from Lifelines into ICD-10 categories [18].First, after discussion with four of the authors (BP, AS, BV, TdB), the following responses were defined: certainly no cancer; unclassifiable cancer (unknown if it is cancer); pre-stage cancer; cancer, but location unknown.Next, a decision was made by the same four authors to classify every response containing variations of the word 'beginning' as 'pre-stage' unless further information was provided in the response, whereas 'early stage' was interpreted as cancer.Polyps were all classified as 'certainly no cancer', rather than 'pre-stage' or 'unclassifiable cancer', unless further specifications were given (e.g., 'a polyp was removed and found to be malignant').The remaining text fields were classified as a specific type of cancer.If no specific tumour location was mentioned by a respondent, but the tumour could be clearly attributed to one of the ICD tumour classification subgroups they were classified accordingly (e.g.'tumour in my head' was classified as CNS tumour).Non-specific descriptions that could not be attributed to a specific tumour type (e.g.'tumour', 'tumours', 'kanker') were classified as 'tumour, subtype unknown'.
The label indicating possible presence of metastases (i.e., 'metastasised') was given to each instance where metastases were specified (e. g., 'breast cancer with metastases').Responses that mentioned a metastasis without mentioning the primary tumour were classified as 'location unknown' and given the label 'metastasised' (e.g., 'metastases in the liver').In those cases where both a primary tumour and metastases were specified, only the primary tumour was coded while applying the label 'possible radiations' (e.g., 'breast cancer with metastases in the liver' was classified as 'breast cancer' and given the label 'metastasised').
Finally, in cases where multiple tumours were mentioned without specifying metastases, all the tumours were coded and the same label ' metastasised ' was given (e.g., the response 'bladder and prostate' was classified as 'bladder cancer' and 'prostate cancer' with the label 'metastasised').A full overview of the labels applied can be found in supplementary table S1.
To further improve the precision of the labelling by the algorithm, a manual check of the classification by the algorithm was performed by going over each recorded term used by respondents and the label given to it by the algorithm.Any discrepancies found in the labelling by the algorithm were then discussed by four of the authors (BP, AS, BV, TdB) before implementing several adjustments to the labelling.In addition to the rules of the algorithm, several manual exceptions were added for those cancers that were incorrectly labelled, while the algorithm rules were correct but could not recognise the specific formulation.For instance, the sentence 'breast cancer and removal of lymph node' would be recognised as 'breast cancer' and 'lymph node cancer' with 'metastasised', but this was manually corrected to only 'breast cancer'.Manual exceptions mentioned above represented less than 1% of the classifications.To facilitate the comparison of the data from Lifelines to PALGA, two new categories of tumours were added (i.e., 'haematological cancer (other)' and 'digestive tract cancer (other)').

The handling of cancer data in PALGA and data linkage
All personal data of people in PALGA and Lifelines databases were pseudonymised based on surname, date of birth, gender, and initials [19,20].Based on these pseudonyms, the records were linked with Lifelines in March 2021.When present the zip code (only four digits) was used to further improve linkage correctness.The pathology diagnoses per participant with histologically confirmed malignancies were matched to the diagnoses in Lifelines.When the first match of PALGA diagnoses and self-reported cancers in Lifelines was evaluated, the outcomes were discussed with PALGA representative to ensure the correctness of the approach.After that check, we conducted the second revision as the final linkage for analyses.For this linkage, pathology diagnoses per participant were matched to the self-reported cancer information.It was assessed if a participant who had a self-reported cancer at the baseline assessment of Lifelines had also a diagnosis in PALGA records (i.e. if a participant self-reported a cancer diagnosis in 2006, for that participant the PALGA diagnoses after 2006 were not considered).Cancer diagnoses in PALGA after 2013 were not compared to self-reported cancers in Lifelines.If a participant had more than one diagnosis from PALGA, the self-reported cancer in Lifelines was compared to up to ten (if available) diagnoses in PALGA to search for a match.

Study variables
To search for the explanation in the observed differences in two data source (Lifelines and PALGA), the following variables were included: year of a cancer diagnosis as provided by PALGA, age, sex and socioeconomic status.To evaluate socioeconomic status, education level was classified as follows: low (i.e., no education, primary education, lower or preparatory vocational education, or lower general secondary education), medium (i.e., intermediate vocational education or apprenticeship, higher general senior secondary education, or pre-university secondary education), and high (i.e., higher vocational education or university) [21].

Statistical analyses
Considering in this study PALGA diagnoses as the gold standard, we evaluated the validity of self-reported cancer by calculating sensitivity (true positive rate) and the positive predicted value (PPV), which are described as proper metrics to evaluate screening test attributes in relation to a gold standard [22].This was done for all reported cancers and after excluding skin and cervical cancers, because those two types of cancer tend to be under-or over-reported in self-reported questionnaires.Sensitivity is presented as the fraction of participants which self-reported a cancer diagnosis in Lifelines out of those with confirmed diagnosis according to PALGA database; PPV is presented as the fraction of participants with confirmed cancer diagnosis according to PALGA database from those which self-reported a cancer diagnosis in Lifelines.Wilson score was used to calculate 95% confidence intervals for sensitivity and PPV.In addition, we evaluated by logistic regression if socioeconomic factors (i.e.age, sex, education level, year of a cancer diagnosis as provided by PALGA-a more recent diagnosis was defined as each increasing year in the PALGA database) were associated with presence of a correct self-reported cancer.The outcome measure (i.e.dependent variable) was a correct match between a self-reported cancer and a cancer report in PALGA (yes or no).Self-reported cancers with the label "cancer, but location unknown" were not included in the overall analysis [4,23,24].Statistical analyses were performed in SPSS v. 23.0 (SPSS Inc., Chicago IL).

Results
More than half of the participants were female (58.40%), the average age was 44.65 (SD ± 13.12) years old, and 142,820 of the participants did not have any history of cancer in the baseline assessment (see Table 1).For the self-reported cancer diagnoses, a total of 6 611 participants reported a previous diagnosis of cancer when filling Lifelines questionnaires.The calendar year range for PALGA diagnoses was 1983-2013, and for the self-reported cancer was 1952-2013 (two cases made the self-reported calendar year so extended, hence those two cases could not be matched to PALGA records).
Of Of the study variables included that could potentially explain the differences in both data sources, it was found that the presence of a selfreported cancer, when there was a confirmed cancer diagnosis, was less likely to happen in older participants (OR 0.98 [95% CI 0.97-0.99]).While participants with a higher educational level were more likely to self-report a diagnosis of cancer compared to those with a lower educational level (OR 1.15 [95% CI 1.01-1.31]),see Table 3.
After excluding skin and cervical cancer from the analysis, results showed that females, those with a higher education level and those with a recent diagnosis were more likely to give a correct self-reported cancer diagnosis (OR 1.44 [95% CI 1.18-1.76],OR 1.48 [95% CI 1.14-1.93],OR 1.03 [95% CI 1.02-1.04],respectively).
In a stratified analysis in females (excluding skin and cervical cancer), older females were more likely to self-report cancer when there was a confirmed cancer diagnosis (1.02 [95% CI 1.01-1.03]).A separate analysis for males showed that older males were less likely to self-report cancer when there was a confirmed cancer diagnosis (0.97 [95% CI 0.96-0.98]).Compared to males with a low educational level, males with a middle or higher educational level were more likely to self-report cancer when there was a confirmed cancer diagnosis (1.32 [95% CI 1.06-1.64];1.38 [95% CI 1.12-1.70]respectively), see Table 3.After excluding skin cancer, older males were still less likely to self-report cancer when there was a confirmed cancer diagnosis (0.98 [95% CI 0.97-0.99]).Compared to males with a low educational level, males with a middle or higher educational level were more likely to self-report cancer when there was a confirmed cancer diagnosis (1.52 [95% CI 1.05-2.20]1.85 [95% CI 1.25-2.73],respectively see Table 3).

Discussion
The present study aimed to evaluate the validity of self-reported cancers diagnoses in the Dutch Lifelines cohort as compared to the data from PALGA, where validity was based on sensitivity and positive predictive value.Overall, there was a moderate sensitivity 64.68%, which increased to 70.18% when skin and cervical cancers were excluded.The cancer-specific sensitivity varied from 35.48% (liver cancer) to 80.24% (testicular cancer).The overall PPV was 97.45%, and 97.33% after the exclusion of skin and cervical cancers.Participants who did not self-report their cancer were more likely to be male, had longer time since diagnosis and had lower educational level.This is the first large study in the Netherlands that evaluated the validity of a self-reported cancer diagnosis.The overall findings regarding sensitivity and PPV are in line with those observed in other studies in Australia, sensitivity 71.1%; PPV 65.7% [11]; Sweden, sensitivity 53% [4]; United States, sensitivity 79%; PPV 75% [14,25,26]; Canada, sensitivity 92.1%; PPV 77.8% [27]; and Korea, sensitivity 72%; PPV 81.9% [8].A previous small study in The Netherlands published in 1994 [6] showed a lower sensitivity (55.2%), compared to findings of the present study (70.18%).This may be explained by the year of publication, as in those days PALGA did not yet have nationwide coverage.Some participants might have change their residence area since diagnosis, which could hamper the linkage of the data.
We found a wide range of sensitivities for the several subtypes of cancer (35.48% for liver to 80.24% for testicular).This is in line with previous population-based studies showing high heterogeneity in the sensitivity depending on the cancer site of the diagnosis.In a recent study from Korea the sensitivity varied from 52.10% for cervical cancer to 81.20% for breast cancer [8].In a study from Australia, the range was from 36.90% for skin cancer to 90.70% for breast cancer [11].
With regard to the sensitivity per specific site, the sensitivity for breast cancer observed in this study was relatively high and similar to other studies [10][11][12]14].This could be because breast cancer might be considered as major event in females compared to other types of cancer [12,28].Hence, females with a breast cancer diagnosis, may be more expected to self-report it [12].Previous studies provide evidence that the types of cancer with a clear definition are more likely to be self-reported, and those with a less severe histologic type are less likely to be self-reported, such as e.g.early stage cervical cancer or skin cancer [7,29,30].We observed high sensitivity for thyroid cancer, and this finding is consistent with other large population-based studies [7,8,11,12].Males-specific cancers in the present study, such as prostate and testicular cancers, were the ones with the highest sensitivity among others.For prostate cancer this is also consistent with other studies [8,11].A possible explanation for this is that both prostate and testicular cancers are a very clear diagnosis for the patient, making it easier to understand.Younger age and better education were shown to be related to a more correct self-reporting of a cancer diagnosis [7,8].The improvement in sensitivity of self-reported cancer data after excluding skin and cervical cancers is similar to other recent studies [7,8,11].The high number of false negatives might be caused by the confusion in participants which could interpret a pre-neoplastic/in situ lesions as invasive cancers [4,12,23].This has been observed in previous studies in which early stage lesions are treated with simple procedures and the medical practice may have not explained properly to the patients in order to not cause them unnecessary stress [8,23].In addition, there are cultural differences in the medical practice, in which patients are better informed about their cancer diagnosis in western Europe and United States compared to southern Europe or Asia [4,7,8,23].As strengths of the present study, it is first worth to mention the large sample size and the representativeness of the Lifelines cohort for the Dutch population [15].The second strength is the use of data of a nationwide pathology registry (PALGA) as a gold standard, that has nearly 100%-coverage of the cancer diagnoses in the Netherlands.The third strength is the use of a careful homogenisation algorithm matching the free text data on diagnoses in both databases, our algorithm required less than 5% of manual corrections (n = 273).Fourth, when researchers ask for access to Lifelines data, they may not have access to PALGA records, as this will lead to additional costs in terms of time and budget.As our study shows that the validity of self-reported data for specific cancers (i.e.breast, prostate, thyroid) is good, for these cancers Lifelines data can be used.However, for other types, such as skin and cervical cancers, self-reported information might not be enough and should be verified with a more reliable source like PALGA.Finally, this study provides the separate evaluation of twenty-seven different site-specific cancers.
Apart of strength, some limitations need to be considered.First, although much attention was paid by the authors while making the labels in the algorithm to increase the precision of self-reported cancers, small differences between the precise labelling of tumours in the algorithm and PALGA records cannot be avoided.Second, the interpretation from the respondent when they report terms such as "in situ" "invasive" "benign", they may not be certain about the meaning of those terms in the context of a medical diagnosis and influenced the interpretation of the algorithm.Third, since not all the cancer diagnoses are confirmed histologically or cytologically, some specific cancer types (e.g.melanoma of the eye and hematopoietic tumours) may be under-represented in the PALGA data by an estimation of 10% [31] of missing information.Fourth, female participants reported having a tumour of the uterus, without specifying whether this was a cervical cancer or an endometrial cancer.This because Dutch Layman's terms for cervix and uterus are much alike and many respondents struggled with the distinction between uterine and cervical cancer.

Conclusion
Overall, the self-reported assessments of cancer in Lifelines have high positive predictive value and moderate sensitivity.The high positive predictive value indicates that when there is a self-reported diagnosis of cancer, this was nearly always found in PALGA records.This indicates of high quality of the PALGA registration.It also indicates that the selfreported assessment of the cancer in the Lifelines population-based cohort can be used for research.The moderate sensitivity implies that there is an underreporting of cancer in the Lifelines cohort.The major sources of incorrect reports are related to skin and cervical cancer.Male participants, those with a lower educational level and those with longer time since diagnosis are less likely to have present a correct self-reported cancer.

Table 1
Baseline characteristics of participants in the Lifelines cohort: those with an at baseline self-reported cancer; those with a histologically confirmed cancer in PALGA; and those without cancer in the Lifelines cohort.

Table 2
Comparison of self-reported cancer to PALGA cancer diagnosis in the Lifelines cohort.
a Sensitivity is presented as the fraction of participants which self-reported a cancer diagnosis in Lifelines out of those with confirmed diagnosis according to PALGA.bPositive predictive value is presented as the fraction of participants with confirmed cancer diagnosis in PALGA database from those which self-reported a cancer diagnosis in Lifelines.cIncludes the ICD 10 categories: D01.0-D01.4.dIncludes the ICD 10 categories: C43-C44.e Includes the ICD 10 categories: C81-C96.fTerms in Dutch included in this category were: wekedelenkanker, kanker in de weke delen.F.O.Cortés-Ibáñez et al.