Concerned Project Repository
You can find a complete version of the project that is described in this paper on my Github account.
https://github.com/DarkCoderSc/Snippets(Python) Extract ASCII/Unicode Strings from File
We (Unprotect Team) are currently working on a new web application that will use the open source knowledge available at Unprotect to scan and automatically highlight suspicious artefacts from a sample.
One planned feature of this new project is the possibility to extract and view embedded printable characters.
After working on this feature, I decided to share the portion of code used in that effect.
This is one possible method to extract both ASCII
and Unicode
strings from any file (without using regex).
This python script supports:
- ASCII string extraction
- Unicode string extraction
- Show extracted string offset
- Define minimum extracted string length.
Code Snippet
#!/usr/bin/env python3
# Jean-Pierre LESUEUR (@DarkCoderSc)
# https://keybase.io/phrozen
import argparse
import mmap
from itertools import chain
def extract_strings(file, min_length=4, unicode=False):
printable_ascii = b"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ "
if unicode:
char_size = 2
else:
char_size = 1
with mmap.mmap(file.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
string = b""
offset = 0
for cursor in range(0, mmap_obj.size(), char_size):
b = mmap_obj.read(char_size)
if b[0] in printable_ascii:
if char_size == 2 and b[1] != 0:
continue
string += b[0].to_bytes(1, byteorder='big')
else:
if len(string) >= min_length:
yield offset, string.decode('ascii')
string = b""
offset = cursor
if __name__ == "__main__":
parser = argparse.ArgumentParser(description=f"Binary String Extractor")
parser.add_argument('-f', '--file', type=argparse.FileType('rb'), dest="file", required=True, help="Binary file to inspect for strings.")
parser.add_argument('-o', '--offset', default=False, dest="show_offset", action="store_true", help="Show string location in file (string offset).")
parser.add_argument('-l', '--min-length', default=4, required=False, dest="min_length", action="store", help="Minimum length of extracted string.")
parser.add_argument('-m', '--extract-mode', dest="mode", default='all', choices=['all', 'ascii', 'unicode'], help="Filter string extraction by its encoding nature.")
try:
argv = parser.parse_args()
except IOError as e:
parser.error()
ascii_strings = iter([])
unicode_strings = iter([])
if argv.mode == "all" or argv.mode == "ascii":
ascii_strings = extract_strings(argv.file, argv.min_length)
if argv.mode == "all" or argv.mode == "unicode":
unicode_strings = extract_strings(argv.file, argv.min_length, True)
for offset, string in chain(ascii_strings, unicode_strings):
if argv.show_offset:
print("{} : {}".format(
offset,
string,
))
else:
print(string)
All content on this website is protected by a disclaimer. Please review it before using our site
Nov. 21, 2022, 12:54 p.m. | By Jean-Pierre LESUEUR