Question

compare file and sizes between folders

  • 31 January 2018
  • 8 replies
  • 33 views

Badge +7

Hi,

I would like to use a python caller to compare the files (and their size) that are located between two folders.

 

 

My idea will be to used a python caller that will give me the list of the filenames for each folder and then used a Featuremerger to compare the two lists of files name by name and size by size.

Any ideas?

Arthy

 

 


8 replies

Badge +22

Not that you can't use python, but why not simply use the Directory and File Pathnames reader, with Retrieve File Properties set to Yes, followed by a ChangeDetector.

Badge +7

Hi @arthy, thanks for posting your question!

 

 

Do wait for more serpent-like answers from Python experts - but in the mean time you might want to look at the Directory and File Pathnames Reader. It'll grab all your directory and/or file names and attributes (like size, owner, date created, etc.) for your comparison. Transformers like the Matcher, ChangeDetector, or DuplicateFilter could be used to compare the results after reading.

 

 

Hope this helps!

 

Nathan
Badge +7

Thanks @NathanAtSafe and @jdh for your replies,

The directory and file pathnames reader can give the results but it is extremely low, That's why I would like to use a python caller given that it is a workbench that I will have to run several times manually.

@takashi, @david_r, or anyone else any thoughts?

Userlevel 2
Badge +17

For example, a PythonCaller with this script creates three list attributes (see below) from two folder paths specified by attributes called "_original_folder" and "_revised_folder", and adds the lists to the input feature. Note: This script example is to just describe a possible way to collect sizes of all files under a folder. It may not be optimal for your final goal. Please modify it appropriately.

# PythonCaller Script Example  
class FileSizesComparer(object):
    def input(self, feature):
        # Collect sizes of all files under the two folders.
        originalSizes, revisedSizes = {}, {} # {<relative file path> : <size>}
        collectFileSizes(originalSizes, feature.getAttribute('_original_folder'))
        collectFileSizes(revisedSizes, feature.getAttribute('_revised_folder'))
        
        originalPaths = set(originalSizes.keys())
        revisedPaths = set(revisedSizes.keys())
        for i, path in enumerate(originalPaths & revisedPaths):
            sizeDiff =  revisedSizes[path] - originalSizes[path]
            feature.setAttribute('_unchanged{%d}.filename' % i, path)
            feature.setAttribute('_unchanged{%d}.size_original' % i, originalSizes[path])
            feature.setAttribute('_unchanged{%d}.size_revised' % i, revisedSizes[path])
            feature.setAttribute('_unchanged{%d}.size_diff' % i, sizeDiff)
                
        for i, path in enumerate(revisedPaths - originalPaths):
            feature.setAttribute('_added{%d}.filename' % i, path)
            feature.setAttribute('_added{%d}.size' % i, revisedSizes[path])
            
        for i, path in enumerate(originalPaths - revisedPaths):
            feature.setAttribute('_deleted{%d}.filename' % i, path)
            feature.setAttribute('_deleted{%d}.size' % i, originalSizes[path])
            
        self.pyoutput(feature)
    
# Helper function: Collect sizes of all files under a specified root folder.
# Arguments
# - pathToSize: dictionary {<relative file path> : <size>}
# - absRoot: absolute path of the root folder
# - relRoot: relative path of the root folder (empty by default)
import os
def collectFileSizes(pathToSize, absRoot, relRoot=''):
    for name in os.listdir(absRoot):
        absPath = os.path.join(absRoot, name) # absolute path of file or directory
        relPath = os.path.join(relRoot, name) # relative path of file or directory
        if not os.path.islink(absPath):
            if os.path.isfile(absPath):
                pathToSize[relPath] = os.path.getsize(absPath)
            else:
                collectFileSizes(pathToSize, absPath, relPath) # recursive call

"_unchanged{}" list contains the information on files existing under both original and revised folders. The list consists of these four members.

  • "filename" stores the relative file path.
  • "size_original" stores size of the file under the original folder.
  • "size_revised" stores size of the file under the revised folder.
  • "size_diff" stores the difference between the sizes of original file and revised file.

"_added{}" list contains the information (filename and size) on files existing only under revised folder.

"_deleted{}" list contains the information (filename and size) on files existing only under original folder.

Userlevel 4

Not that you can't use python, but why not simply use the Directory and File Pathnames reader, with Retrieve File Properties set to Yes, followed by a ChangeDetector.

Agreed, much simpler.
Badge +7

For example, a PythonCaller with this script creates three list attributes (see below) from two folder paths specified by attributes called "_original_folder" and "_revised_folder", and adds the lists to the input feature. Note: This script example is to just describe a possible way to collect sizes of all files under a folder. It may not be optimal for your final goal. Please modify it appropriately.

# PythonCaller Script Example  
class FileSizesComparer(object):
    def input(self, feature):
        # Collect sizes of all files under the two folders.
        originalSizes, revisedSizes = {}, {} # {<relative file path> : <size>}
        collectFileSizes(originalSizes, feature.getAttribute('_original_folder'))
        collectFileSizes(revisedSizes, feature.getAttribute('_revised_folder'))
        
        originalPaths = set(originalSizes.keys())
        revisedPaths = set(revisedSizes.keys())
        for i, path in enumerate(originalPaths & revisedPaths):
            sizeDiff =  revisedSizes[path] - originalSizes[path]
            feature.setAttribute('_unchanged{%d}.filename' % i, path)
            feature.setAttribute('_unchanged{%d}.size_original' % i, originalSizes[path])
            feature.setAttribute('_unchanged{%d}.size_revised' % i, revisedSizes[path])
            feature.setAttribute('_unchanged{%d}.size_diff' % i, sizeDiff)
                
        for i, path in enumerate(revisedPaths - originalPaths):
            feature.setAttribute('_added{%d}.filename' % i, path)
            feature.setAttribute('_added{%d}.size' % i, revisedSizes[path])
            
        for i, path in enumerate(originalPaths - revisedPaths):
            feature.setAttribute('_deleted{%d}.filename' % i, path)
            feature.setAttribute('_deleted{%d}.size' % i, originalSizes[path])
            
        self.pyoutput(feature)
    
# Helper function: Collect sizes of all files under a specified root folder.
# Arguments
# - pathToSize: dictionary {<relative file path> : <size>}
# - absRoot: absolute path of the root folder
# - relRoot: relative path of the root folder (empty by default)
import os
def collectFileSizes(pathToSize, absRoot, relRoot=''):
    for name in os.listdir(absRoot):
        absPath = os.path.join(absRoot, name) # absolute path of file or directory
        relPath = os.path.join(relRoot, name) # relative path of file or directory
        if not os.path.islink(absPath):
            if os.path.isfile(absPath):
                pathToSize[relPath] = os.path.getsize(absPath)
            else:
                collectFileSizes(pathToSize, absPath, relPath) # recursive call

"_unchanged{}" list contains the information on files existing under both original and revised folders. The list consists of these four members.

  • "filename" stores the relative file path.
  • "size_original" stores size of the file under the original folder.
  • "size_revised" stores size of the file under the revised folder.
  • "size_diff" stores the difference between the sizes of original file and revised file.

"_added{}" list contains the information (filename and size) on files existing only under revised folder.

"_deleted{}" list contains the information (filename and size) on files existing only under original folder.

Thanks @takashi

 

Badge

For example, a PythonCaller with this script creates three list attributes (see below) from two folder paths specified by attributes called "_original_folder" and "_revised_folder", and adds the lists to the input feature. Note: This script example is to just describe a possible way to collect sizes of all files under a folder. It may not be optimal for your final goal. Please modify it appropriately.

# PythonCaller Script Example  
class FileSizesComparer(object):
    def input(self, feature):
        # Collect sizes of all files under the two folders.
        originalSizes, revisedSizes = {}, {} # {<relative file path> : <size>}
        collectFileSizes(originalSizes, feature.getAttribute('_original_folder'))
        collectFileSizes(revisedSizes, feature.getAttribute('_revised_folder'))
        
        originalPaths = set(originalSizes.keys())
        revisedPaths = set(revisedSizes.keys())
        for i, path in enumerate(originalPaths & revisedPaths):
            sizeDiff =  revisedSizes[path] - originalSizes[path]
            feature.setAttribute('_unchanged{%d}.filename' % i, path)
            feature.setAttribute('_unchanged{%d}.size_original' % i, originalSizes[path])
            feature.setAttribute('_unchanged{%d}.size_revised' % i, revisedSizes[path])
            feature.setAttribute('_unchanged{%d}.size_diff' % i, sizeDiff)
                
        for i, path in enumerate(revisedPaths - originalPaths):
            feature.setAttribute('_added{%d}.filename' % i, path)
            feature.setAttribute('_added{%d}.size' % i, revisedSizes[path])
            
        for i, path in enumerate(originalPaths - revisedPaths):
            feature.setAttribute('_deleted{%d}.filename' % i, path)
            feature.setAttribute('_deleted{%d}.size' % i, originalSizes[path])
            
        self.pyoutput(feature)
    
# Helper function: Collect sizes of all files under a specified root folder.
# Arguments
# - pathToSize: dictionary {<relative file path> : <size>}
# - absRoot: absolute path of the root folder
# - relRoot: relative path of the root folder (empty by default)
import os
def collectFileSizes(pathToSize, absRoot, relRoot=''):
    for name in os.listdir(absRoot):
        absPath = os.path.join(absRoot, name) # absolute path of file or directory
        relPath = os.path.join(relRoot, name) # relative path of file or directory
        if not os.path.islink(absPath):
            if os.path.isfile(absPath):
                pathToSize[relPath] = os.path.getsize(absPath)
            else:
                collectFileSizes(pathToSize, absPath, relPath) # recursive call

"_unchanged{}" list contains the information on files existing under both original and revised folders. The list consists of these four members.

  • "filename" stores the relative file path.
  • "size_original" stores size of the file under the original folder.
  • "size_revised" stores size of the file under the revised folder.
  • "size_diff" stores the difference between the sizes of original file and revised file.

"_added{}" list contains the information (filename and size) on files existing only under revised folder.

"_deleted{}" list contains the information (filename and size) on files existing only under original folder.

@takashi I have the same issue and I found this post online, can you give me a better explanation on how to modify this code and test it out? I need to compare file sizes or a PRE upgrade and POST upgrade output files. 

Badge +2

@tosinbabs The workspace suggested by @jdh would look something like this (2018.1): filechangedetector.fmw

Reply