How to compare vCard address book entries

(0 comments)

Following a little mishap with the DAVdroid app on my Fire HDX Android tablet, I ended up with a lot of duplicate contacts in my CardDAV account. Upon closer inspection I then noticed, that in the process of copying the contacts from the local contact store (on my Blackberry smartphone) to the CardDAV account, certain contact details had been lost and spurious characters (like spaces and question marks) introduced.

Additionally, I had started editing the contact details for proper display in the Android contact app by Blackberry: The app does unfortunately not display the organization info provided for contacts not connected to a person (like institutions) in the address book view - contrary to that view on BlackBerry 10 before.

So, I needed an estimate of the mess my contact data was in. An internet search came up with this article, but that method produced way to many hits to be of much use in my case:

diff after.vcf before.vcf | egrep "(<|>) N:" | sort -k 2 | uniq -c -i -s 2 | tee diff.txt | wc -l && cat diff.txt

The idea then was to give vcardtools found on github a try for normalization, but those crashed and burned (vobject.base.NativeError) upon parsing the VCF data. Ultimately, it came down to nittygritty text manipulation with UNIX console tools, as a preparatory step for further processing:

#!/bin/sh
filename=`basename $1 .vcf`
cat $1 | grep -v -E '^N(:|;);;;' | sed 's/CHARSET=UTF-8://g' | sed 's/\xe2\x80\x8b//g' | sed 's/?//g' | sed -E 's/^(N|FN|ORG);/\1:/g' | sed 's/ \+/ /g' | sed -E 's/(:|;) /\1/' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//' | awk 'BEGIN{name="";fname="";org="";prev=""} /^N:/{name=$0}/^FN:/{fname=$0}/^ORG:/{org=$0}/^END:VCARD/{if (name!=""&&fname!="") {print fname;prev=fname} else {if (org!="") {print org;prev=org} else {if (NR>1) print "ERROR"}};name="";fname="";org=""} END{}'| tee "${filename}".cleaned | cut -d":" -f2| sort | tee "${filename}".sorted | uniq -c -i -s 2 | tee "${filename}".counted | cut -c9- > "${filename}".names

First, we eliminate unwanted lines using grep, then we clean out unwanted characters and spaces using sed and finally we select the relevant contact fields (name or organization) usingawkto end up with a manageable data subset, on which to check for differences.

Note: The initial version was based on a comparison with the previous contact entry and while that is not used in the script right now, I have not cleaned it out (as it might prove helpful in the future).

Currently unrated

Comments

There are currently no comments

New Comment

required

required (not published)

optional

required