I have implemented this code:
dfA = pd.read_csv(args.file,index_col="Full_url",sep=",",engine='c',skipinitialspace=True, encoding='utf-8',dtype={ "City": object,"Country": object,"State": object,"Email": object,"Identifier": object,"Family": object,"Given": object,"Prefix": object,"Suffix": object,"Phone": object})
indexer = rl.Index() indexer.add(Full()) candidate_links = indexer.index(dfA) compare_cl = rl.Compare()
compare_cl.exact('Identifier', 'Identifier', label='Identifier') compare_cl.string('City', 'City', method='jarowinkler', threshold=0.85, label='City') compare_cl.string('Country', 'Country', method='jarowinkler', threshold=0.85, label='Country') compare_cl.string('State', 'State', method='jarowinkler', threshold=0.85, label='State') compare_cl.string('Email', 'Email', method='damerau_levenshtein', threshold=0.80, label='Email') compare_cl.string('Family', 'Family', method='jarowinkler', threshold=0.80, label='Family') compare_cl.string('Given', 'Given', method='jarowinkler', threshold=0.80, label='Given') compare_cl.string('Prefix', 'Prefix', method='jarowinkler', threshold=0.80, label='Prefix') compare_cl.string('Suffix', 'Suffix', method='jarowinkler', threshold=0.80, label='Suffix') compare_cl.exact('Phone', 'Phone', label='Phone')
features = compare_cl.compute(candidate_links, dfA)
However,
I have a problem because the column 'Family' is a vector of names with a variable length.
For example, a register could be:
Family=Daniel||Alex||John||Felix
The items in a vector always are splitted by the character "||". Can I compare the column 'Family' as a vector? How do I indicate the character of separation?
Thanks.
What I have tried:
I have' tried nothing because i can't find a viable solution.