[Python] Apply SHA-3 to MySQL Dataset with pysha3

SHA-3, a subset of the cryptographic primitive family Keccak is a cryptographic hash function, designed to be very efficient in hardware but is relatively slow in software. SHA-3 takes about double the time compared to SHA-2 to run in software and about a quarter of the time to run in hardware.

Although many of you might still be discovering the newest NIST adoption, SHA-3, some companies are already trying to implement this algorithm to securely encrypt their data.

In the latest project I was involved we have a very particular scenario in which we decided to use SHA-3:

  • We were interested in comparing the (#percentage of common customers) among several companies of the same Group

For example:

  • COMPANY A
    • C1
    • C2
  • COMPANY B
    • C1
    • C3

In this example we can see that customer C1 exists in both companies and C2 and C3 exist only in one company, this way we can say that: “33% of our customers are shared between COMPANY A and COMPANY B”.

Because we were not interested in comparing the customer real data like Name or Address and there were some legal constraints related to sharing customer data between companies, we decided to hash in SHA-3, the customers name and phone number, which were defined as being unique by the group and share the results as CSV.

To orchestrate this we decided to develop a Python script with the following structure:

  1. Query MySQL and retrieve customer data
  2. Hash the customer Name and Phone column with SHA-3 with the library Pysha3
  3. Export data set to CSV

In the following code block you will find the entire script, that I hope can be usefull for you followed by a simple explanation on how to use it.


from __future__ import generators    # needs to be at the top of your module
import codecs
import mysql.connector
import csv
import sys
import hashlib
import sha3


#CONFIG REGION - The only section you need to change to use the script
header 					= 	["customer_name","customer_phone"]
query 					=	"SELECT customer_name,customer_phone FROM customer"
host					=	'localhost' 
database				=	'my_customer_database'
mysql_user				=	'MYSQL_USER'
mysql_password			=	'MYSQL_PASSWORD'
mysql_port				=	'MY_SQL_PORT' #Default 3306
output_csv_file_path	=	'PATH_TO_CSV_FILE'
csv_delimiter			= 	','
hash_alg 				=	'sha3_256' #Can be sha3_256 or sha3_512



# === [BEGIN FUNCTIONS REGION] ===

def ResultIter(cursor, arraysize=100):
    'An iterator that uses fetchmany to keep memory usage down'
    while True:
        results = cursor.fetchmany(arraysize)
        if not results:
            break
        for result in results:
            yield result

#execute mysql command and return data set
def mysql_execute(query_r,host_r,database_r,user_r,password_r,port_r):	
	output_c = {}		
	#connect setup
	cnx = mysql.connector.connect(
	host=host_r,
	database=database_r,
	user=user_r,
	password=password_r,
	port=port_r)
	#result cursor
	cursor = cnx.cursor()
	#execution query
	query = (query_r)
	#save to cursor
	print '2/8 executing query'
	cursor.execute(query)
	#iterate cursor
	counter=0
	batch_controler=0 
	print '3/8 finished executing query'
	print '4/8 iterating in cursor'		
	for x in ResultIter(cursor):
		output_c[counter]=x
		counter+=1
	print '5/8 closing cursor'
	cursor.close()
	cnx.close()	
	return output_c

# apply hash on a string 
def apply_hash(string_in,algorithm):
	#new_safe_str = string_in.encode('utf-8','ignore')
	new_safe_str = string_in.encode('utf8')
	#new_safe_str = u' '.join(string_in.encode('utf-8')).strip()
	#create hash worker
	hash_worker = hashlib.new(algorithm)	
	#apply hash
	hash_worker.update(new_safe_str)
	#return hexadecimal hash
	return hash_worker.hexdigest()

# === [END FUNCTIONS REGION] ===
# ==============================	
# ===   [BEGIN MAIN REGION]  ===

#get values from DB
print '1/8 fetching dataset'
data_set_result = mysql_execute(query,host,database,mysql_user,mysql_password,mysql_port)

print '6/8 got the dataset'
#create a new hashed list
hashed_list = []	
#add header to list
hashed_list.append(header)	
#iterate each value of dictionary
print '7/8 hashing'
for x in data_set_result:
	new_list_row =[]	
	for y in data_set_result[x]:
		#apply hash
		new_list_row.append(apply_hash(y,hash_alg))		
	hashed_list.append(new_list_row)

#for debug
#print hashed_list

#save to csv
print '8/8 saving to csv'

with open(output_csv_file_path, 'wb') as myfile:	
	wr = csv.writer(myfile,delimiter=csv_delimiter,quoting=csv.QUOTE_ALL)
	for row in hashed_list:
		#print row		
		wr.writerow(row)
myfile.close()

# ===    [END MAIN REGION]   ===

Script Explanation & Instructions:

As you can see in the previous script you will find 3 defined regions:

  • Config Region
    • This is the region were you define the parameters for executing the script.
    • This is actually the only region you need to change to make the script work
    • You will find the following parameters
      • header – the header for the csv file
      • query – the query that you run against mysql to get your target dataset
      • host  – MySQL server
      • database – MySQL schema you want to query
      • mysql_user – mysql username to connect to the instance
      • mysql_password – mysql password to connect to the instance
      • mysql_port – mysql port were the mysql server is listening
      • output_csv_file_path – the path to the csv file you want the result to be saved
      • csv_delimiter – the CSV delimiter you want to use
      • hash_alg – the hash algorithm to be used (The pysha3 module provides `sha3_228()`, `sha3_256()`,`sha3_384()`, and `sha3_512()`)
  • Functions Region
    • Were the main behaviors are encapsulated.
    • This is the region were you will find:
      • Function that queries MYSQL
        • mysql_execute(query_r,host_r,database_r,user_r,password_r,port_r)
      • Function that encapsulates Pysha3 usage (Apply SHA-3)
        • apply_hash(string_in,algorithm)
  • Main Region
    • This is the orchestration region
    • Where you can set up your call sequence
    • Where you will find the logic that exports to CSV

To call the script you need to change each parameter of the config region and then call the script.

The result should be something like:

cust

 

Any question that you have don´t hesitate in contacting me.

Thank you,

Rui Machado

 

 

 

 

Anúncios

Deixe uma Resposta

Preencha os seus detalhes abaixo ou clique num ícone para iniciar sessão:

Logótipo da WordPress.com

Está a comentar usando a sua conta WordPress.com Terminar Sessão / Alterar )

Imagem do Twitter

Está a comentar usando a sua conta Twitter Terminar Sessão / Alterar )

Facebook photo

Está a comentar usando a sua conta Facebook Terminar Sessão / Alterar )

Google+ photo

Está a comentar usando a sua conta Google+ Terminar Sessão / Alterar )

Connecting to %s