Information Synchronization Across Multilingual Semi-Structured Tables

About

The representation of information across languages poses significant challenges, particularly when it comes to the synchronization of semi-structured data. Wikipedia is a notable example, with the English version comprising only 11.68% of all pages despite having the greatest number of editors (75). 94% of the global population does not have access to comprehensive information in their native language. The majority of non-English Wikipedia pages are outdated and inadequately maintained. Moreover, Wikipedia translations are frequently inaccurate. An illustration of the issue is provided below.

Promoting inclusivity and facilitating global knowledge sharing requires ensuring accurate representation and bridging the language gap. Maintaining the consistency and integrity of Wikipedia tables across multiple languages requires meticulous attention to detail. This will pave the way for the creation of a trustworthy, comprehensive, and language-inclusive knowledge source.

The objective of introducing the InfoSync dataset and employing a two-step method for tabular synchronization is to provide effective solutions to this problem.

Dataset Details

To systematically assess the challenge of information synchronization and evaluate the methodologies, we build a large-scale table synchronization dataset InfoSync based on entity-centric Wikipedia Infoboxes.

We collected a dataset comprising approximately 99,440 infoboxes and 1,078,717 rows. The dataset included information in multiple languages, namely English, German, French, Spanish, Dutch, Arabic, Hindi, Chinese, Korean, Afrikaans, Cebuano, and Swedish. The infoboxes covered various categories such as Airport, Album, Animal, Athlete, Book, City, College, Company, Country, Diseases, Food, Medicine, Monument, Movie, Musician, Nobel, Painting, Person, Planet, Shows, and Stadium. This diverse dataset serves as the foundation for our research analysis and experimentation.

Average Table Transfer % Language Statistics
C1 -> Σ ln Σ ln -> C1 # Table Average Rows
af 17.46 400.5 1575 9.91
ar 34.02 27.38 7648 13.01
ceb 42.87 134.88 3870 7.82
de 40.73 27.12 8215 7.88
en 45.85 0.32 12431 12.60
es 38.78 9.0 9950 12.59
fr 41.25 4.73 10858 10.30
hi 18.39 358.97 1724 10.91
ko 31.13 40.51 6601 9.35
nl 33.69 24.6 7837 10.46
ru 36.98 14.54 9066 11.41
sv 35.53 24.62 7985 9.89
tr 28.99 59.33 5599 10.14
zh 36.16 32.71 7140 12.43
Table: Average Table Transfer:- Column 2 shows the average number of tables missing in other languages which can be transferred from C1. Column 3 shows the average number of tables missing in C1, which we can transfer from all languages to C1. Here L is the set of all languages (ln) except source or transfer language. Language Statistics:- The number of tables and average rows (AR) per table across different categories for each language.

Topic # Table Average Rows
Airport 18512 9.66
Food 6184 7.93
Album 5833 7.58
Animal 3209 8.27
Category Statistics: Number of tables in each category and average number of rows (AR) across different languages. Statistics on all categories present in the paper above.

Test Sets

We created several test sets to evaluate the alignment accuracy of our pipeline for different configurations.

Synchronization Methodology

Our proposed approach for table synchronization involves two steps:

We evaluate the effectiveness of both tasks using the InfoSync dataset. Additionally, we conduct an online experiment adhering to Wikipedia editing guidelines, where we submit detected mismatches for review by Wikipedia editors. We track the number of edits approved or refused by the editors.

Information Alignment

The proposed method consists of five modules designed to generate additional alignments sequentially by aligning table rows with relaxed matching requirements.

  1. Corpus-based: Aligns rows based on the cosine similarity of their English translations, taking multiple translations into account using majority voting. Accurate key translations take into account additional context, such as key values and categories.

  2. Key-only: Attempts to align unaligned pairs from the previous module by computing cosine similarity of their English translations, with a threshold for selecting mutually most similar keys.

  3. Key value bidirectional: Similar to the previous step, but computes similarities using the entire row (key + value) and applies a threshold for alignment.

  4. Key value unidirectional: Relaxes the bidirectional mapping constraint by considering the highest similarity in either direction, using a higher threshold to avoid spurious alignments.

  5. Multi-key: Allows for the selection of multiple keys (up to two) based on a threshold, with a soft constraint for value-combination alignment. Valid multi-key alignment occurs when the merge value-combination similarity score exceeds that of the most similar key.

In summary, these five modules progressively relax the matching requirements, incorporating different aspects of the table rows, to generate alignments based on cosine similarity scores.

Alignment Example And Evaluation

Explanation of Alignment Performance Metrics: Ten and Thi are a collection of all rows in the English and Hindi tables, respectively. Rxn represents the nth row in the language table. Rx(X) retrieves all rows in the language x using mapping X. |.| represents the set's cardinality. Every alignment is saved as a tuple in form (Rxm, Ryn). G is a collection of all gold (human) alignments. P is a collection of predicted alignments (can see there are mistakes in the alignment.

Information Updation

We proposed a rule-based heuristic approach for information updates. These rules are applied sequentially according to their priority rank (P.R.).

  1. Row Transfer (R1): Unaligned rows are transferred from one table to another.

  2. Mutli-Match (R2): Updating the table by handling multi-alignments and merging information to address cases with multiple key alignments.

  3. Time Based (R3): Updating aligned values using the latest timestamp to ensure the information reflects the most current data.

  4. Trends positive/negative (R4): Updating values based on identified monotonic patterns (increasing or decreasing) over time, particularly applicable to athlete career statistics.

  5. Append Values (R5): Appending additional value information from an up-to-date row to update outdated rows.

  6. HR to LR (R6): Transferring information from a high resource language to a low resource language to update outdated information.

  7. #Rows (R7): Transferring information from a table with a greater number of rows to a table with fewer rows.

  8. Non Popular Keys (R8): Updating information from a table where recently added non-popular keys are likely to exist in order to update outdated tables.

Updation Example

Below is an update example. The infobox for Shirley Strickland de la Hunty has been updated to include information in both English and Spanish. It shows rows transfer for missing information, value substitution because "Aged 78" is absent in Died. Additionally, one medal infomation (Bronze,1952, 100m) is added in to medal tally.


Human Assisted Wikipedia Updates

Information update results are given to human editors for updating Wikipedia infoboxes. Following Wikipedia's guidelines, rule set, and policies, update requests were submitted with other evidence supporting the claim. This evidence consists of the up-to-date entity page URL in the source language, the specific table rows information along with the source language, details of the proposed changes, and an additional citation provided by the editor for further validation.


Accepted Rejected Total
Eng => X 161 43 204
X => Y 169 47 216
X => English 136 47 183
Total 466 137 603
Table : Human-Assisted Wikipedia infobox updates: Accept/Reject rate for different flows of information.

People

The InfoSync dataset is prepared by collaboration of across multiple institutions University of Utah, IIT Guwahati, CTAE and Bloomberg LP by the following people:

From left to right Siddharth Khincha,Chelsi Jain, Vivek Gupta*,Tushar Kataria* and Shuo Zhang.

Citation

Please cite our paper as below if you use the InfoSync dataset.

 @inproceedings{khincha-etal-2023-infosync,
    title = "{I}nfo{S}ync: Information Synchronization across Multilingual Semi-structured Tables",
    author = "Khincha, Siddharth  and
      Jain, Chelsi  and
      Gupta, Vivek  and
      Kataria, Tushar  and
      Zhang, Shuo",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.159",
    pages = "2536--2559",
    abstract = "Information Synchronization of semi-structured data across languages is challenging. For example, Wikipedia tables in one language need to be synchronized with others. To address this problem, we introduce a new dataset InfoSync and a two-step method for tabular synchronization. InfoSync contains 100K entity-centric tables (Wikipedia Infoboxes) across 14 languages, of which a subset ({\textasciitilde}3.5K pairs) are manually annotated. The proposed method includes 1) Information Alignment to map rows and 2) Information Update for updating missing/outdated information for aligned tables across multilingual tables. When evaluated on InfoSync, information alignment achieves an F1 score of 87.91 (en {\textless}-{\textgreater} non-en). To evaluate information updation, we perform human-assisted Wikipedia edits on Infoboxes for 532 table pairs. Our approach obtains an acceptance rate of 77.28{\%} on Wikipedia, showing the effectiveness of the proposed method.",
} 

Acknowledgement

Authors thank members of the Utah NLP group for their valuable insights and suggestions at various stages of the project; and ACL 2023 reviewers for pointers to related works, corrections, and helpful comments. Authors thank the largest free resource Wikipedia for InfoSync tables.