Reducing file storage consumption

I'd like to share a process I use in storage reduction scenarios involving the file system.

Sugar stores files such as email and notes attachments, document revisions, embedded images and files in Knowledge Base articles, user pictures, and even custom module attachments, if applicable.

All of the SELECT database queries I provide for data gathering can be used in the Custom Queries section of the Advanced Reports module.

Alternatives

SugarCRM's team can run a de-duplication script for your file system. I’ve seen numerous duplicates occur due to files included in email signatures, such as business logos or social media icons. Imagine your users have archived 10,000 emails, all containing the same image. In that scenario, there are 10,000 copies of that file stored in your file system, consuming valuable storage space.

Upon request through a support case, the support team can run the de-duplication script. It will identify the 10,000 copies, keep one, and map all references to that single copy, freeing up 99.99% of the storage space consumed by that one image.

Additionally, wherever Data Archiver provides the ability to filter records to accurately target those you wish to hard delete, I recommend using it. If you delete notes, emails, or documents using Data Archiver, their attachments will also be deleted. However, at the time of writing this post, Data Archiver cannot target email attachments alone. If you want to delete attachments but keep the emails, you will need to use a process like the one described below.

Now, onto the steps for identifying and deleting files to reduce storage consumption.


Step 1: Determine types of files, i.e., attachments or documents?

The first step is to determine what types of files are consuming the most space. The two most common culprits are email/notes attachments or document revisions.

For this post, I won’t focus on Knowledge Base articles, user pictures, or custom module attachments, as these are less common. Instead, I'll concentrate on the scenarios you're most likely to encounter.

Email and notes attachments are both recorded as attachments in the same database table: notes.

The following queries can indicate how much of your file system storage is being used by attachments or document revisions:

SELECT COUNT(id), SUM(file_size) FROM notes WHERE file_size > 0;
SELECT COUNT(id), SUM(file_size) FROM document_revisions WHERE file_size > 0;

The results are in bytes, so adding commas (or periods) will help you visualize the file sizes in KB, MB, or GB.

In most cases, the majority of the size is recorded in the notes module. For simplicity, I will focus on notes examples below, but similar queries can be used for document_revisions by adjusting the module name in the queries.


Step 2: Identify when volume of files were created.

In most storage cases I’ve worked on, decisions about what to delete are filtered by date range.

If you want to see a breakdown of how much content was created each year, you can use this query:

SELECT SUBSTRING(date_entered, 1, 4) a, COUNT(id), SUM(file_size) FROM notes WHERE file_size > 0 GROUP BY a ORDER BY a ASC;

To break it down by month in 2024:

SELECT SUBSTRING(date_entered, 1, 7) a, COUNT(id), SUM(file_size) FROM notes WHERE SUBSTRING(date_entered, 1, 4) = '2024' AND file_size > 0 GROUP BY a ORDER BY a ASC;

And to break it down by day in August 2024:

SELECT SUBSTRING(date_entered, 1, 10) a, COUNT(id), SUM(file_size) FROM notes WHERE SUBSTRING(date_entered, 1, 7) = '2024-08' AND file_size > 0 GROUP BY a ORDER BY a ASC;


Step 3: Gather a list of file names to delete

When a file is associated with a note or document revision, it is usually named after the ID number of the corresponding database row.

Let’s say I’ve decided to delete the attachments for all records created in August 2024. To get the IDs of the relevant rows, thereby listing the file names I want to target for deletion, I can run this query:

SELECT id FROM notes WHERE SUBSTRING(date_entered, 1, 7) = '2024-08' AND file_size > 0;

Export the results into a CSV file and open it in Excel to perform the following operations.

Note: There are other columns in the notes module, like attachment_flag, that you can filter on to distinguish note records from attachments. While this can be helpful, focusing on file_size is typically sufficient when the goal is storage reduction.


Step 4: Identify the parent folder containing each file.

All files associated with records in Sugar are stored in the upload/ directory, but they’re split across up to 4,095 subdirectories named in hexadecimal numbers, 000 through fff.

Each file’s subdirectory is predictable based on its ID. For example, the file named f6376262-8663-11ef-99ae-0684a87b501c will be stored at: 

upload/262/f6376262-8663-11ef-99ae-0684a87b501c

The name of the subdirectory is determined by the sixth, seventh, and eighth characters of the file name.

In Excel, with the ID numbers in column A, starting at row 1, you can use this function in column B:

=MID(A1,6,3)

Copying this formula down the column will automatically adjust the reference for each row. You can quickly copy it down by double-clicking the bottom-right corner of cell B1.


Step 5: Prepare the list of delete commands.

With columns A and B filled in, generate the necessary delete command for each row by entering this function in C1:

=CONCATENATE("rm -f ~/upload/",B1,"/",A1)

Each row should look something like this:

rm -f ~/upload/262/f6376262-8663-11ef-99ae-0684a87b501c

Copy the values from column C into a text file, and add #!/bin/bash as the first line of the file. Save it as a .sh file.


Step 6: Upload, move, and run your .sh file into the file system.

If you have direct file system access, you may already have a reliable method to copy files into the instance’s file system. However, I’ll share a trick I find handy.

Open the Notes module in the application, create a new note, attach your .sh file, and save it.

In the database, locate the new note's ID. Assuming the .sh file is named f6376262-8663-11ef-99ae-0684a87b501c, navigate to upload/262/ and run:

cp f6376262-8663-11ef-99ae-0684a87b501c ../../ExampleDeletionList.sh

With the file in the root directory and the necessary permissions set, run:

bash ./filename.sh

The script will execute, iterating through each line to remove the targeted files.


Conclusion

I’ve provided a detailed, step-by-step guide for those with On-Site instances and file system access to complete the described process. If your instance is hosted on SugarCloud, you will need support to complete the final steps of moving and running your .sh file, but you can still follow most of the steps up to that point.

I genuinely hope this has been helpful!

Parents
  • Thank you for your post  

    Some great tips here that are much appreciated. 
    Can you please confirm that purging records (by deleting them) will also purge the corresponding file?

    What happens if we turn on auto-archiver (moving records to _archive table, not deleting them), will the attachments also move to an "uploads archive"?

    Just an Idea:

    It would be really great if files in the upload directory could be organized in subdirs based on the module they belong in.

    This would separate Email attachments from, say, Contract Documents (a custom module created via Studio based on the Documents template).

    I for one would like to have much more frequent and robust off-site backups for things like Contracts than I do for all the tiny png images from email signatures for emails (I have a pet-peeve about those) and attachments related to Cases which are just a waste of space. 
    That would be so much easier if I could just identify a directory to back up.

    Thanks,

    Francesca

Reply
  • Thank you for your post  

    Some great tips here that are much appreciated. 
    Can you please confirm that purging records (by deleting them) will also purge the corresponding file?

    What happens if we turn on auto-archiver (moving records to _archive table, not deleting them), will the attachments also move to an "uploads archive"?

    Just an Idea:

    It would be really great if files in the upload directory could be organized in subdirs based on the module they belong in.

    This would separate Email attachments from, say, Contract Documents (a custom module created via Studio based on the Documents template).

    I for one would like to have much more frequent and robust off-site backups for things like Contracts than I do for all the tiny png images from email signatures for emails (I have a pet-peeve about those) and attachments related to Cases which are just a waste of space. 
    That would be so much easier if I could just identify a directory to back up.

    Thanks,

    Francesca

Children
No Data